API Reference

Inference

Regression

class gpmap.inference.MinimumEpistasisInterpolator(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', P=2, a=None, cg_rtol=1e-05)

Mininum epistasis interpolation model for sequence-function relationships.

A class for performing Minimum Epistasis Interpolation (MEI) to infer complete genotype-phenotype maps from incomplete and noisy data. This model applies a prior that penalizes local epistatic coefficients of order P and infers the posterior distribution based on experimental data for a subset of sequences.

Parameters:

n_allelesint, optional: The number of alleles per site. If not provided, it will be inferred from the provided data.
seq_lengthint, optional: The length of the genotype sequences. If not provided, it will be inferred from the provided data.
genotypesarray-like, optional: A list or array of genotypes to be used in the interpolation. If not provided, the model will infer the genotype space.
alphabet_typestr, optional: The type of alphabet used for genotypes. Default is “custom”.
Pint, optional: The order of epistasis to consider. Default is 2. This determines the level of interaction between genetic sites that is penalized.
afloat, optional: The regularization parameter. If not provided, it will be inferred during the fitting process to best match the observed data.
cg_rtolfloat, optional: The relative tolerance for the conjugate gradient solver. Default is 1e-5. This controls the precision of the solver used in computations.

Methods

`fit`(X, y[, y_var])	Fits the Minimum Epistasis Interpolation (MEI) model hyperparameter to the provided data.
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
`predict`([X_pred, calc_variance])	Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
`sample_prior`()	Generate a sample from the prior distribution.
`simulate`([X, y_var, p_missing, seed])	Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

fit(X, y, y_var=None)

Fits the Minimum Epistasis Interpolation (MEI) model hyperparameter to the provided data.

This method infers the optimal regularization parameter a by computing the Minimum Epistasis Interpolation solution. It determines the value of a such that the expected average squared Pth epistatic coefficients match those of the MEI solution.

Parameters:

Xarray-like of shape (n_obs,): Array containing the genotypes for which observations are provided in y.
yarray-like of shape (n_obs,): Array containing the observed phenotypes corresponding to the genotypes in X.
y_vararray-like of shape (n_obs,), optional: Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:

contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts): A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:

contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include:

estimate: Posterior mean for each contrast.
std: Posterior standard deviation for each contrast.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.
p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:

X_predarray-like of shape (n_genotypes,), optional: Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
calc_variancebool, optional, default=False: If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:

predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included:

f_var: Posterior variance for each genotype.
f_std: Posterior standard deviation for each genotype.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.

The genotype labels are used as the row index.

Notes

The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

Examples

Predict phenotypes for the entire genotype space:

>>> pred = model.predict()

Predict phenotypes for specific genotypes with variance:

>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:

f: numpy.ndarray: A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(X=None, y_var=0.0, p_missing=0, seed=None)

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

Xarray-like, optional: Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
y_varfloat or array-like, optional: Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
p_missingfloat, optional: Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
seedfloat, optional: Random seed for reproducibility. Default is None.

Returns:

farray-like: The true simulated measurements without experimental noise.
Xarray-like: The input sequences used for the simulation.
yarray-like: The simulated measurements with experimental noise added.
y_vararray-like: The standard deviation of the experimental noise for each input sequence.

Raises:

ValueError: If the shape of y_var does not match the expected dimensions.

Examples

Simulate data with default parameters:

>>> f, X, y, y_var = gp.simulate()

Simulate data with custom noise and missing probability:

>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)

class gpmap.inference.VCregression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', lambdas=None, beta=0, cross_validation=False, nfolds=5, cv_loss_function='frobenius_norm', num_beta=20, min_log_beta=-2, max_log_beta=7, cg_rtol=1e-05, progress=True)

Variance Component regression model for sequence-function relationships.

This model enables the inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior. The prior is parameterized by the contribution of different orders of interaction to the observed genetic variability of a continuous phenotype.

Parameters:

n_allelesint, optional: The number of alleles per site. If not provided, it will be inferred from the data.
seq_lengthint, optional: The length of the genotype sequences. If not provided, it will be inferred from the data.
genotypesarray-like, optional: A list or array of genotypes to be used in the interpolation.
alphabet_typestr, optional: The type of alphabet used for genotypes. Default is “custom”.
lambdasarray-like, optional: Variance components for each order of interaction. If not provided, they will be inferred during fitting.
betafloat, optional: The regularization parameter for the kernel alignment. Default is 0.
cross_validationbool, optional: Whether to perform cross-validation to select the best penalization constant for regularized variance component inference. Default is False.
nfoldsint, optional: The number of folds for cross-validation. Default is 5.
cv_loss_functionstr, optional: The loss function to use during cross-validation. Options are “frobenius_norm”, “logL”, or “r2”. Default is “frobenius_norm”.
num_betaint, optional: The number of beta values to evaluate during cross-validation. Default is 20.
min_log_betafloat, optional: The minimum log10(beta) value for cross-validation. Default is -2.
max_log_betafloat, optional: The maximum log10(beta) value for cross-validation. Default is 7.
cg_rtolfloat, optional: The relative tolerance for the conjugate gradient solver. Default is 1e-5.
progressbool, optional: Whether to display progress bars during fitting. Default is True.

Methods

`fit`(X, y[, y_var, method])	Infers the Variance Components from the provided data.
`get_variance_components`([lambdas])	Return the variance components as a DataFrame from :math:`lambda`s.
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
`predict`([X_pred, calc_variance])	Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
`sample_prior`()	Generate a sample from the prior distribution.
`simulate`([X, y_var, p_missing, seed])	Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

fit(X, y, y_var=None, method='L-BFGS-B')

Infers the Variance Components from the provided data.

This method infers the variance components, which represent the relative contribution of different orders of interaction to the variability in the sequence-function relationships. Variance components are determined through kernel alignment with the empirical distance-covariance function.

After fitting, the optimal variance components (lambdas) are stored in the VCregression.lambdas attribute for use in predictions.

Parameters:

Xarray-like of shape (n_obs,): Array containing the genotypes for which observations are provided in y.
yarray-like of shape (n_obs,): Array containing the observed phenotypes corresponding to the genotypes in X.
y_vararray-like of shape (n_obs,), optional: Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
methodstr, optional: Optimization method to use during kernel alignment. Default is ‘L-BFGS-B’.

get_variance_components(lambdas=None)

Return the variance components as a DataFrame from :math:`lambda`s.

Parameters:

lambdasarray-like, optional: An array of eigenvalues representing the variance components. If not provided, the model’s current lambdas attribute will be used.

Returns:

pandas.DataFrame

A DataFrame containing the following columns:

k: Index of the variance component (ranging from 0 to seq_length).
lambdas: The input eigenvalues.
var_perc: The percentage of variance explained by each component.
var_perc_cum: The cumulative percentage of variance explained.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:

contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts): A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:

contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include:

estimate: Posterior mean for each contrast.
std: Posterior standard deviation for each contrast.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.
p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:

X_predarray-like of shape (n_genotypes,), optional: Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
calc_variancebool, optional, default=False: If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:

predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included:

f_var: Posterior variance for each genotype.
f_std: Posterior standard deviation for each genotype.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.

The genotype labels are used as the row index.

Notes

The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

Examples

Predict phenotypes for the entire genotype space:

>>> pred = model.predict()

Predict phenotypes for specific genotypes with variance:

>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:

f: numpy.ndarray: A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(X=None, y_var=0.0, p_missing=0, seed=None)

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

Xarray-like, optional: Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
y_varfloat or array-like, optional: Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
p_missingfloat, optional: Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
seedfloat, optional: Random seed for reproducibility. Default is None.

Returns:

farray-like: The true simulated measurements without experimental noise.
Xarray-like: The input sequences used for the simulation.
yarray-like: The simulated measurements with experimental noise added.
y_vararray-like: The standard deviation of the experimental noise for each input sequence.

Raises:

ValueError: If the shape of y_var does not match the expected dimensions.

Examples

Simulate data with default parameters:

>>> f, X, y, y_var = gp.simulate()

Simulate data with custom noise and missing probability:

>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)

class gpmap.inference.ConnectednessModelRegression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', mu=None, cg_rtol=1e-05, progress=True)

Connectedness model regression for sequence-function relationships.

This model enables the inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior. The prior is parameterized by parameters controlling the effect of mutations at specific sites on the predictability of other mutations.

Parameters:

n_allelesint, optional: The number of alleles per site. If not provided, it will be inferred from the data.
seq_lengthint, optional: The length of the genotype sequences. If not provided, it will be inferred from the data.
genotypesarray-like, optional: A list or array of genotypes to be used in the interpolation.
alphabet_typestr, optional: The type of alphabet used for genotypes. Default is “custom”.
muarray-like, optional: Factors controlling the site-specific decay factors. If not provided, they will be inferred during fitting.
cg_rtolfloat, optional: The relative tolerance for the conjugate gradient solver. Default is 1e-5.
progressbool, optional: Whether to display progress bars during fitting. Default is True.

Methods

`fit`(X, y[, y_var, method])	Infers the site-specific decay factors from the provided data.
`get_decay_factors`()	Return the decay factors as a DataFrame from :math:`mu`s.
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
`predict`([X_pred, calc_variance])	Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
`sample_prior`()	Generate a sample from the prior distribution.
`simulate`([X, y_var, p_missing, seed])	Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

fit(X, y, y_var=None, method='L-BFGS-B')

Infers the site-specific decay factors from the provided data.

This method infers the site-specific decay factors, which represent control the expected decrease in the predictability of other mutations in the presence of mutations at each site. Decay factors are inferred through kernel alignment with the empirical distance-covariance function.

After fitting, the optimal decay factors are used to build a Gaussian process prior for inference.

Parameters:

Xarray-like of shape (n_obs,): Array containing the genotypes for which observations are provided in y.
yarray-like of shape (n_obs,): Array containing the observed phenotypes corresponding to the genotypes in X.
y_vararray-like of shape (n_obs,), optional: Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
methodstr, optional: Optimization method to use during kernel alignment. Default is ‘L-BFGS-B’.

get_decay_factors()

Return the decay factors as a DataFrame from :math:`mu`s.

Returns:

pandas.DataFrame

A DataFrame containing the following columns:

p: 0-indexed position in the sequence.
mu: The $\mu$ value associated to each position.
decay_factor: Decay factor associated to each position.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:

contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts): A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:

contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include:

estimate: Posterior mean for each contrast.
std: Posterior standard deviation for each contrast.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.
p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:

X_predarray-like of shape (n_genotypes,), optional: Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
calc_variancebool, optional, default=False: If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:

predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included:

f_var: Posterior variance for each genotype.
f_std: Posterior standard deviation for each genotype.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.

The genotype labels are used as the row index.

Notes

The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

Examples

Predict phenotypes for the entire genotype space:

>>> pred = model.predict()

Predict phenotypes for specific genotypes with variance:

>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:

f: numpy.ndarray: A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(X=None, y_var=0.0, p_missing=0, seed=None)

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

Xarray-like, optional: Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
y_varfloat or array-like, optional: Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
p_missingfloat, optional: Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
seedfloat, optional: Random seed for reproducibility. Default is None.

Returns:

farray-like: The true simulated measurements without experimental noise.
Xarray-like: The input sequences used for the simulation.
yarray-like: The simulated measurements with experimental noise added.
y_vararray-like: The standard deviation of the experimental noise for each input sequence.

Raises:

ValueError: If the shape of y_var does not match the expected dimensions.

Examples

Simulate data with default parameters:

>>> f, X, y, y_var = gp.simulate()

Simulate data with custom noise and missing probability:

>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)

class gpmap.inference.LocalEpistasisRegression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', P=2, a_values=None, lambda_U_lower_than_P=None, cg_rtol=1e-05, progress=True)

Local epistasis regression model for sequence-function relationships.

A class for performing Local Epistasis Regression (LER) to infer complete genotype-phenotype maps from incomplete and noisy data. This model applies a prior that penalizes local epistatic coefficients of order P differently depending on the combinations of P sites and infers the posterior distribution based on experimental data for a subset of sequences.

Parameters:

n_allelesint, optional: The number of alleles per site. If not provided, it will be inferred from the provided data.
seq_lengthint, optional: The length of the genotype sequences. If not provided, it will be inferred from the provided data.
genotypesarray-like, optional: A list or array of genotypes to be used in the model. If not provided, the model will infer the genotype space.
alphabet_typestr, optional: The type of alphabet used for genotypes. Default is “custom”.
Pint, optional: The order of epistasis to consider. Default is 2. This determines the level of interaction between genetic sites that is penalized.
a_valuesarray-like, optional: The regularization parameters for each interaction order. If not provided, they will be inferred during the fitting process to best match the observed data.
lambda_U_lower_than_Parray-like, optional: The regularization parameters for interactions with order lower than P. If not provided, it will be inferred during fitting.
cg_rtolfloat, optional: The relative tolerance for the conjugate gradient solver. Default is 1e-16. This controls the precision of the solver used in computations.
progressbool, optional: Whether to display progress bars during fitting. Default is True.

Methods

`fit`(X, y[, y_var, method])	Fits the Local Epistasis Regression (LER) model hyperparameters to the provided data.
`get_a_values`([position_labels])	Return a DataFrame of interaction-specific regularization parameters.
`get_empirical_pred_correlations_df`()	Compute empirical and predicted correlations for pairs of sequences differing at all possible combinations of sites U.
`get_lambda_U_values`([position_labels])	Return a DataFrame of interaction-specific lambda values for interactions U with order lower than P.
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
`predict`([X_pred, calc_variance])	Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
`sample_prior`()	Generate a sample from the prior distribution.
`simulate`([X, y_var, p_missing, seed])	Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

fit(X, y, y_var=None, method='L-BFGS-B')

Fits the Local Epistasis Regression (LER) model hyperparameters to the provided data.

This method infers the optimal regularization parameters a via kernel alignment of the residuals of a P-1 order interaction model fit via maximum likelihood. Thus, we infer the a values and lambda_U that best match the empirical covariance.

Parameters:

Xarray-like of shape (n_obs,): Array containing the genotypes for which observations are provided in y.
yarray-like of shape (n_obs,): Array containing the observed phenotypes corresponding to the genotypes in X.
y_vararray-like of shape (n_obs,), optional: Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
methodstr, optional: Optimization method to use during kernel alignment. Default is ‘L-BFGS-B’.

get_a_values(position_labels=None)

Return a DataFrame of interaction-specific regularization parameters.

Parameters:

position_labelsarray-like of shape (seq_length,), optional: Labels for sequence positions (ints or strings). If None, defaults to np.arange(self.seq_length).

Returns:

pandas.DataFrame: Rows correspond to interactions U. Columns include: - ‘site{i}’ (for i=0..|U|-1): labels of positions in U - ‘a_U’: regularization parameter for interaction U - ‘interaction_strength’: 1.0 / a_U

get_empirical_pred_correlations_df()

Compute empirical and predicted correlations for pairs of sequences differing at all possible combinations of sites U.

Returns:

pandas.DataFrame: DataFrame indexed by concatenated site labels with columns - d: number of sites in the set - n: number of observations for that set - emp_cor: empirical centered autocovariance normalized by the zero-lag value - pred_cor: predicted autocovariance (from the current aligner) normalized likewise - d_jittered: jittered d useful for plotting

get_lambda_U_values(position_labels=None)

Return a DataFrame of interaction-specific lambda values for interactions U with order lower than P.

Parameters:

position_labelsarray-like of shape (seq_length,), optional: Labels for sequence positions (ints or strings). If None, defaults to np.arange(self.seq_length).

Returns:

pandas.DataFrame: Rows correspond to interactions U (only those with order < P). Columns: - ‘U’: comma-separated position labels in U - ‘k’: number of sites in U - ‘lambda_U’: regularization parameter for interaction U

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:

contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts): A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:

contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include:

estimate: Posterior mean for each contrast.
std: Posterior standard deviation for each contrast.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.
p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:

X_predarray-like of shape (n_genotypes,), optional: Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
calc_variancebool, optional, default=False: If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:

predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included:

f_var: Posterior variance for each genotype.
f_std: Posterior standard deviation for each genotype.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.

The genotype labels are used as the row index.

Notes

The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.

Examples

Predict phenotypes for the entire genotype space:

>>> pred = model.predict()

Predict phenotypes for specific genotypes with variance:

>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:

f: numpy.ndarray: A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(X=None, y_var=0.0, p_missing=0, seed=None)

Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.

Xarray-like, optional: Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
y_varfloat or array-like, optional: Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
p_missingfloat, optional: Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
seedfloat, optional: Random seed for reproducibility. Default is None.

Returns:

farray-like: The true simulated measurements without experimental noise.
Xarray-like: The input sequences used for the simulation.
yarray-like: The simulated measurements with experimental noise added.
y_vararray-like: The standard deviation of the experimental noise for each input sequence.

Raises:

ValueError: If the shape of y_var does not match the expected dimensions.

Examples

Simulate data with default parameters:

>>> f, X, y, y_var = gp.simulate()

Simulate data with custom noise and missing probability:

>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)

Density estimation

class gpmap.inference.SeqDEFT(n_alleles=None, seq_length=None, alphabet_type='custom', genotypes=None, P=2, a=None, num_reg=20, nfolds=5, lambdas_P_inv=None, a_resolution=0.1, max_a_max=1000000000000.0, fac_max=0.1, fac_min=1e-06, optimization_opts={}, maxiter=10000, gtol=1e-06, ftol=1e-08)

Model for inference of a genotype-phenotype map from observations of sequences.

Sequence Density Estimation using Field Theory (SeqDEFT) model for inferring a complete sequence probability distribution under a Gaussian Process prior. The prior is parameterized by the variance of local epistatic coefficients of order P.

Parameters:

Pint: The order of local interaction coefficients penalized under the prior. For example, P=2 penalizes local pairwise interactions across all possible faces of the Hamming graph, while P=3 penalizes local 3-way interactions across all possible cubes.
afloat, optional, default=None: A parameter related to the inverse variance of the P-order epistatic coefficients being penalized. Larger values induce stronger penalization, approximating the Maximum-Entropy model of order P-1. If a=None, the optimal value of a is determined through cross-validation.
num_regint, optional, default=20: The number of a values to evaluate during the cross-validation procedure.
nfoldsint, optional, default=5: The number of folds to use in the cross-validation procedure.
lambdas_P_invarray-like, optional, default=None: The inverse of the variance components for the first P orders of interaction. If provided, these values are used to regularize the kernel basis.
a_resolutionfloat, optional, default=0.1: The resolution for determining the range of a values during cross-validation.
max_a_maxfloat, optional, default=1e12: The maximum value of a to consider during cross-validation.
fac_maxfloat, optional, default=0.1: A factor to determine the maximum value of a relative to the number of P-order faces in the Hamming graph.
fac_minfloat, optional, default=1e-6: A factor to determine the minimum value of a relative to the number of P-order faces in the Hamming graph.
optimization_optsdict, optional, default={}: A dictionary of options for the optimization procedure used to calculate the maximum entropy model.
maxiterint, optional, default=10000: The maximum number of iterations for the optimization procedure.
gtolfloat, optional, default=1e-6: The gradient tolerance for the optimization procedure.
ftolfloat, optional, default=1e-8: The function tolerance for the optimization procedure.

Methods

`fit`(X[, y, baseline_phi, baseline_X, ...])	Infers the SeqDEFT model hyperparameter a from the provided data.
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
`predict`([X_pred, calc_variance])	Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
`sample_prior`()	Generate a sample from the prior distribution.
`simulate`(N[, seed])	Simulates data under the specified a penalization for local P-epistatic coefficients.

fit(X, y=None, baseline_phi=None, baseline_X=None, positions=None, phylo_correction=False, adjust_freqs=False, allele_freqs=None)

Infers the SeqDEFT model hyperparameter a from the provided data.

This method determines the optimal regularization parameter a by evaluating the log-likelihood of held-out sequences under a grid search for a in cross-validation settings.

Parameters:

Xarray-like of shape (n_obs,): Array containing the observed sequences.
yarray-like of shape (n_obs,): Array containing the weights for each observed sequence. By default, each sequence is assigned a weight of 1. These weights can be computed using phylogenetic correction.
baseline_Xarray-like of shape (n_genotypes,), optional: Array containing the sequences associated with baseline_phi.
baseline_phiarray-like of shape (n_genotypes,), optional: Array containing the baseline values (baseline_phi) to include in the model.
positionsarray-like of shape (n_pos,), optional: If provided, subsequences at these positions in the input sequences will be used as input.
phylo_correctionbool, optional, default=False: Whether to apply phylogenetic correction using the full-length sequences.
adjust_freqsbool, optional, default=False: Whether to adjust densities by the expected allele frequencies in the full-length sequences.
allele_freqsdict or codon_table, optional: Dictionary containing the expected allele frequencies for each allele in the set of possible sequences, or a codon table to generate expected amino acid frequencies. If None, these frequencies will be calculated from the full-length observed sequences.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.

This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.

Parameters:

contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts): A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.

Returns:

contrastspd.DataFrame of shape (n_contrasts, 5)

A DataFrame summarizing the posterior distribution for each contrast. The columns include:

estimate: Posterior mean for each contrast.
std: Posterior standard deviation for each contrast.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.
p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.

Parameters:

X_predarray-like of shape (n_genotypes,), optional: Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
calc_variancebool, optional, default=False: If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.

Returns:

predpd.DataFrame of shape (n_genotypes, n_columns)

A DataFrame containing the predicted phenotypes for each input genotype in the column f. If calc_variance=True, additional columns are included:

f_var: Posterior variance for each genotype.
f_std: Posterior standard deviation for each genotype.
ci_95_lower: Lower bound of the 95% credible interval.
ci_95_upper: Upper bound of the 95% credible interval.

The genotype labels are used as the row index.

If neither X_pred nor calc_variance are provided, the output DataFrame includes additional columns:

freq: Empirical frequencies of the genotypes.
Q_star: Estimated genotype probabilities.

Examples

Predict phenotypes for the entire genotype space:

>>> pred = model.predict()

Predict phenotypes for specific genotypes with variance:

>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)

sample_prior()

Generate a sample from the prior distribution.

This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.

Returns:

f: numpy.ndarray: A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.

simulate(N, seed=None)

Simulates data under the specified a penalization for local P-epistatic coefficients.

Parameters:

Nint: Number of total sequences to sample.
seedint, optional (default=None): Random seed to use for simulation.

Returns:

phiarray-like of shape (N,): Vector containing the true phi values from which samples were generated.
Xarray-like of shape (N,): Vector containing the sampled sequences from the probability distribution.

Examples

>>> model = SeqDEFT(n_alleles=4, seq_length=5, P=2, a=1.0)
>>> phi, X = model.simulate(N=100, seed=42)

Summary statistics

Experimental data

Class for computing low-level descriptors of genotype-phenotype data (observed experimental data sampled from the full sequence space).

This class extends SequenceSpaceRelatedObject and provides convenience routines to store observed data and compute covariance and local epistatic summaries using operators defined for the full sequence space. Unlike GPmapSummarizer which operates on a complete genotype- phenotype map, GPDataSummarizer works with a (possibly sparse) dataset of genotypes and corresponding phenotypes.

Parameters:

seq_lengthint, optional: Number of sites in the sequence (sequence length). Required unless provided via genotypes.
alphabetlist of str, optional: Alphabet for each site (list of characters). Required unless provided via genotypes or inferred.
alphabet_typestr, default “custom”: Type of alphabet (keeps compatibility with SequenceSpaceRelatedObject).
genotypesnp.ndarray, optional: Array of genotype strings (one per observation). If provided, seq_length and alphabet may be inferred from these.
Xnp.ndarray, optional: Design / indicator matrix mapping observed genotypes to the full sequence-space basis. Shape (n_obs, n_full_genotypes).
ynp.ndarray, optional: Observed phenotype values corresponding to rows of X. Shape (n_obs,).
y_varnp.ndarray, optional: Measurement variances for each observation. If None, zeros are used.

Methods

`calc_covariance_U_sites`([centered])	Compute empirical auto-covariance function depending on the combination of subsets U at which pairs of genotypes differ.
`calc_covariance_distance`([centered])	Compute empirical auto-covariance function depending on the Hamming distance between pairs of genotypes.

calc_covariance_U_sites(centered: bool = False)

Compute empirical auto-covariance function depending on the combination of subsets U at which pairs of genotypes differ.

Parameters:

centeredbool, optional: If True, compute covariances using centered phenotypes (y - mean) and do not add back the phenotype mean square. If False (default), add the phenotype mean square to produce raw (uncentered) second-moment estimates.

Returns:

covnp.ndarray, shape (2 ** self.seq_length,): Covariance (or mean product) estimates for each subset U in the same order returned by self.get_Us().
nsnp.ndarray, shape (2 ** self.seq_length,): Number of observed genotype pairs that differ exactly on the sites specified by each subset U.

calc_covariance_distance(centered: bool = False)

Compute empirical auto-covariance function depending on the Hamming distance between pairs of genotypes.

Parameters:

centeredbool, optional: If True, return covariances computed on centered phenotypes (y - mean) and do not add back the phenotype mean square. If False (default), the phenotype mean square is added to produce raw (uncentered) second-moment estimates.

Returns:

covnp.ndarray, shape (seq_length + 1,): Covariance (or mean product) estimates for each Hamming distance d = 0..seq_length. Note: cov[0] is adjusted to remove the mean measurement variance (self.y_var_mean).
nsnp.ndarray, shape (seq_length + 1,): Number of pairs of sequences at each distance class.

Complete landscapes

class gpmap.summary.GPmapSummarizer(n_alleles: int, seq_length: int, f=None)

Class for computing low-level descriptors of a complete genotype-phenotype map.

Parameters:

n_allelesint: Number of alleles per site.
seq_lengthint: Number of sites in the sequence (sequence length).
farray-like, optional: Phenotype values for every possible genotype, ordered lexicographically. If None, the phenotype vector can be provided later when calling instance methods.

Methods

`calc_V_U_variance_components`([f])	Compute variance components contributed by interactions between every possible subset of sites U.
`calc_V_k_variance_components`([f])	Compute variance components contributed by interactions of each order k.
`calc_root_mean_squared_epistatic_coeff`([P, f])	Compute root mean squared epistatic coefficient of order P across all possible combinations of P mutations in the complete genotype-phenotype map.
`calc_site_pairs_variance_perc`(V_U_vcs[, min_k])	Compute the percentage variance explained by genetic interactions of at least order `min_k` involving every possible pair of sites from previously computed V_U variance components.
`calc_sites_variance_perc`(V_U_vcs)	Compute the percentage variance explained by genetic interactions of every possible order involving every possible site from previously computed V_U variance components.

calc_V_U_variance_components(f=None)

Compute variance components contributed by interactions between every possible subset of sites U.

Calculates the total variance in the phenotype vector f explained by genetic interactions involving all subsets of sites U. For each U this method projects f onto the corresponding subspace using VUProjectionOperator and computes its norm.

Parameters:

farray-like, optional: Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.

Returns:

V_U_vcspd.DataFrame

DataFrame with shape (seq_length, 5) and columns:

U: subset of sites
k: interaction order (1..seq_length)
variance: total variance explained by order k
variance_perc: percentage of total variance explained by k
variance_perc_cum: cumulative percentage up to and including k

Notes

Percentages are scaled so that the sum of variance_perc is 100.

calc_V_k_variance_components(f=None)

Compute variance components contributed by interactions of each order k.

Calculates the total variance in the phenotype vector f explained by genetic interactions of order k for k = 1..seq_length. For each k this method projects f onto the corresponding subspace using ProjectionOperator and computes its norm.

Parameters:

farray-like, optional: Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.

Returns:

V_k_vcspd.DataFrame

DataFrame with shape (seq_length, 4) and columns:

k: interaction order (1..seq_length)
variance: total variance explained by order k
variance_perc: percentage of total variance explained by k
variance_perc_cum: cumulative percentage up to and including k

Notes

Percentages are scaled so that the sum of variance_perc is 100.

calc_root_mean_squared_epistatic_coeff(P=2, f=None)

Compute root mean squared epistatic coefficient of order P across all possible combinations of P mutations in the complete genotype-phenotype map.

Parameters:

Pint: The order of local epistatic coefficients to compute e.g. P=1 reflects mutational effects, P=2 epistatic coefficients, etc.
farray-like, optional: Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.

Returns:

rmsecfloat: Root mean squared epistatic coefficient of order P

calc_site_pairs_variance_perc(V_U_vcs, min_k=2)

Compute the percentage variance explained by genetic interactions of at least order min_k involving every possible pair of sites from previously computed V_U variance components.

Parameters:

V_U_vcspd.DataFrame

DataFrame with shape (seq_length, 5) and columns:

U: subset of sites
k: interaction order (1..seq_length)
variance: total variance explained by order k
variance_perc: percentage of total variance explained by k
variance_perc_cum: cumulative percentage up to and including k

This DataFrame is the output of calc_V_U_variance_components.

min_kint, optional

Minimum interaction order to include. Defaults to 2. Must satisfy 1 <= min_k <= self.seq_length.

Returns:

vcs_percpd.DataFrame: Table with columns site1, site2, variance, and variance_perc that reports the percentage variance contributed by interactions of order >= min_k for each site pair.

Raises:

ValueError: If min_k is outside the range 1..``self.seq_length`` or if V_U_vcs references unexpected sites.

Notes

Percentages are scaled so that the sum of variance_perc is 100.

calc_sites_variance_perc(V_U_vcs)

Compute the percentage variance explained by genetic interactions of every possible order involving every possible site from previously computed V_U variance components.

Parameters:

V_U_vcspd.DataFrame

DataFrame with shape (seq_length, 5) and columns:

U: subset of sites
k: interaction order (1..seq_length)
variance: total variance explained by order k
variance_perc: percentage of total variance explained by k
variance_perc_cum: cumulative percentage up to and including k

This DataFrame is the output of calc_V_U_variance_components.

Returns:

vcs_percpd.DataFrame of shape (seq_length, seq_length): Table where the rows index interaction order (1..seq_length) and the columns index each site position. Each entry reports the percentage of the total variance explained by components of order k that involve site p.

Raises:

ValueError: If V_U_vcs references sites outside self.positions.

Notes

Percentages are scaled so that the sum of variance_perc is 100.

Visualization

Discrete Spaces

class gpmap.space.DiscreteSpace(adjacency_matrix, y=None, state_labels=None)

Class to define an arbitrary discrete space characterized by the connectivity between different states and optionally by a scalar value (e.g. fitness or energy) associated with each state.

Parameters:

adjacency_matrixscipy.sparse.csr_matrix of shape (n_states, n_states): Sparse matrix representing the adjacency relationships between states. The (i, j) entry contains a 1 if states i and j are connected, and 0 otherwise.
yarray-like of shape (n_states,), optional: Function value associated with each state.
state_labelsarray-like of shape (n_states,), optional: Labels for the states in the discrete space.

Attributes:

n_statesint: Number of states in the discrete space.
state_labelsarray-like of shape (n_states,): Labels for the states in the discrete space.
state_idxspd.Series of shape (n_states,): A pandas Series mapping state labels to their corresponding indices. The index of the Series is state_labels, allowing quick lookup of indices for a given set of state labels.
is_regularbool: Attribute characterizing whether the space is regular, this is, every

Methods

`get_edges_df`()	Generate a DataFrame representing the edges of the adjacency graph.
`get_neighbor_pairs`()	Retrieve pairs of indices representing connected states in the DiscreteSpace.
`get_neighbors`(states[, max_distance])	Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
`get_state_idxs`(states)	Get the indexes for the provided state labels.

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.

Returns:

edges_dfpd.DataFrame: A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

Returns:

tuple of np.ndarray: Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.

get_neighbors(states, max_distance=1)

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

Parameters:

statesarray-like of shape (state_number,): A list or numpy array of state labels from which to find neighbors.
max_distanceint, optional, default=1: The maximum distance within which neighbors of the provided states will be included.

Returns:

neighbor_statesnp.array: An array containing the state labels of all unique neighbors within the specified distance from the input states.

get_state_idxs(states)

Get the indexes for the provided state labels.

Parameters:

statesarray-like: A list or array of state labels for which the indexes are to be retrieved.

Returns:

pandas.Series: A pandas Series containing the indexes corresponding to the provided state labels.

class gpmap.space.GridSpace(length, y=None, ndim=2)

N-dimensional grid discrete space.

A discrete space formed by the Cartesian product of one-dimensional spaces of ordered n-states, represented by a line graph.

Parameters:

length: int or array-like: The number of states across each dimension of the grid. If an integer is provided, all dimensions of the grid will have the same length. If an array-like of lengths is provided, they will be used to form a grid with the specified dimensions, and the ndim argument will be ignored.
ndim: int: The number of dimensions in the grid when a single length value is provided.
y: array-like of shape (length ** ndim,) or None: Phenotypic values associated with each possible state.

Methods

`get_edges_df`()	Generate a DataFrame representing the edges of the adjacency graph.
`get_neighbor_pairs`()	Retrieve pairs of indices representing connected states in the DiscreteSpace.
`get_neighbors`(states[, max_distance])	Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
`get_state_idxs`(states)	Get the indexes for the provided state labels.
`set_peaks`(positions[, sigma])	Set peaks in the grid space by assigning function values based on distances from specified positions.

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.

Returns:

edges_dfpd.DataFrame: A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

Returns:

tuple of np.ndarray: Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.

get_neighbors(states, max_distance=1)

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

Parameters:

statesarray-like of shape (state_number,): A list or numpy array of state labels from which to find neighbors.
max_distanceint, optional, default=1: The maximum distance within which neighbors of the provided states will be included.

Returns:

neighbor_statesnp.array: An array containing the state labels of all unique neighbors within the specified distance from the input states.

get_state_idxs(states)

Get the indexes for the provided state labels.

Parameters:

statesarray-like: A list or array of state labels for which the indexes are to be retrieved.

Returns:

pandas.Series: A pandas Series containing the indexes corresponding to the provided state labels.

set_peaks(positions, sigma=1)

Set peaks in the grid space by assigning function values based on distances from specified positions.

Parameters:

positionsarray-like of shape (n_peaks, ndim): Coordinates of the peaks in the grid space. Each row represents the position of a peak in the n-dimensional space.
sigmafloat, optional, default=1: Controls the spread of the peaks. Smaller values result in sharper peaks, while larger values create broader peaks.

class gpmap.space.CodonSpace(allowed_aminoacids, codon_table='Standard', add_variation=False, seed=None)

Generate a 3-nucleotide sequence space based on allowed amino acids.

This class creates a nucleotide sequence space corresponding to the provided amino acid constraints using a codon table. Optionally, random variation can be added to the nucleotide space.

Parameters:

allowed_aminoacidsstr or array-like: A single amino acid (as a string) or a list/array of allowed amino acids.
codon_tablestr, optional: The codon table to use for mapping amino acids to nucleotides. Default is “Standard”.
add_variationbool, optional: If True, adds random variation to the nucleotide space. Default is False.
seedint, optional: Seed for the random number generator, used when add_variation is True. Default is None.

Methods

`get_edges_df`()	Generate a DataFrame representing the edges of the adjacency graph.
`get_neighbor_pairs`()	Retrieve pairs of indices representing connected states in the DiscreteSpace.
`get_neighbors`(states[, max_distance])	Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
`get_state_idxs`(states)	Get the indexes for the provided state labels.

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.

Returns:

edges_dfpd.DataFrame: A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

Returns:

tuple of np.ndarray: Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.

get_neighbors(states, max_distance=1)

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

Parameters:

statesarray-like of shape (state_number,): A list or numpy array of state labels from which to find neighbors.
max_distanceint, optional, default=1: The maximum distance within which neighbors of the provided states will be included.

Returns:

neighbor_statesnp.array: An array containing the state labels of all unique neighbors within the specified distance from the input states.

get_state_idxs(states)

Get the indexes for the provided state labels.

Parameters:

statesarray-like: A list or array of state labels for which the indexes are to be retrieved.

Returns:

pandas.Series: A pandas Series containing the indexes corresponding to the provided state labels.

class gpmap.space.SequenceSpace(X=None, y=None, seq_length=None, n_alleles=None, alphabet_type='dna', alphabet=None, stop_y=None)

Space of all possible sequences of certain length.

Class for creating a Sequence space characterized by having sequences as states. States are connected in the discrete space if they differ by a single position in the sequence. It can be created in two different ways:

From a set of sequences and function values (X, y).
By specifying the properties of the sequence space (alphabet, sequence length, number of alleles per site, and type of alphabet).

Parameters:

Xarray-like of shape (n_genotypes,), optional: Sequences to use as state labels of the discrete sequence space.
yarray-like of shape (n_genotypes,), optional: Quantitative phenotype or fitness associated with each genotype.
seq_lengthint, optional: Length of the sequences in the sequence space. If not provided, it will be inferred from alphabet or n_alleles.
n_alleleslist of int, optional: List containing the number of alleles present at each site in the sequence space. This can only be specified for alphabet_type=’custom’.
alphabet_typestr, default=’dna’: Type of sequence. Options are {‘dna’, ‘rna’, ‘protein’, ‘custom’}.
alphabetlist of lists, optional: A list where each element is itself a list containing the different alleles allowed at each site. The number and type of alleles can vary across sites.
stop_yfloat, optional: Value of the function assigned to protein sequences with an in-frame stop codon. If provided, the protein alphabet will be extended to include * for stop codons.

Attributes:

n_genotypesint: Number of states in the complete sequence space.
genotypesarray-like of shape (n_genotypes,): Genotype labels in the sequence space.
adjacency_matrixscipy.sparse.csr_matrix of shape (n_genotypes, n_genotypes): Sparse matrix representing the adjacency relationships between genotypes. The (i, j) entry contains a 1 if genotypes i and j differ by a single mutation, and 0 otherwise.
yarray-like of shape (n_genotypes,), optional: Quantitative phenotype or fitness associated with each genotype.
is_regularbool: Attribute characterizing whether the space is regular, this is, every

Methods

`get_edges_df`()	Generate a DataFrame representing the edges of the adjacency graph.
`get_neighbor_pairs`()	Retrieve pairs of indices representing connected states in the DiscreteSpace.
`get_neighbors`(states[, max_distance])	Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
`get_single_mutant_matrix`(sequence[, center])	Calculate the effects of single point mutations from a focal sequence.
`get_state_idxs`(states)	Get the indexes for the provided state labels.
`remove_codon_incompatible_transitions`([...])	Recalculate the adjacency matrix to allow only codon-compatible transitions in a protein sequence space.
`to_nucleotide_space`([codon_table, alphabet_type])	Convert a protein sequence space into a nucleotide sequence space.

get_edges_df()

Generate a DataFrame representing the edges of the adjacency graph.

This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.

Returns:

edges_dfpd.DataFrame: A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.

get_neighbor_pairs()

Retrieve pairs of indices representing connected states in the DiscreteSpace.

Returns:

tuple of np.ndarray: Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.

get_neighbors(states, max_distance=1)

Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.

Parameters:

statesarray-like of shape (state_number,): A list or numpy array of state labels from which to find neighbors.
max_distanceint, optional, default=1: The maximum distance within which neighbors of the provided states will be included.

Returns:

neighbor_statesnp.array: An array containing the state labels of all unique neighbors within the specified distance from the input states.

get_single_mutant_matrix(sequence, center=False)

Calculate the effects of single point mutations from a focal sequence.

Parameters:

sequencestr: The sequence from which to compute all single point mutant effects.
centerbool, optional, default=False: If True, the results will be centered by position, ensuring that the mean of allelic effects at each position is 0. If False, the focal sequence will have a value of 0, and the results will represent mutational effects relative to it.

Returns:

outputpd.DataFrame of shape (seq_length, total_alleles): A DataFrame containing the mutational or allelic effects for each allele across all sequence positions.

get_state_idxs(states)

Get the indexes for the provided state labels.

Parameters:

statesarray-like: A list or array of state labels for which the indexes are to be retrieved.

Returns:

pandas.Series: A pandas Series containing the indexes corresponding to the provided state labels.

remove_codon_incompatible_transitions(codon_table='Standard')

Recalculate the adjacency matrix to allow only codon-compatible transitions in a protein sequence space.

This method updates the adjacency matrix of the sequence space to ensure that transitions between states are compatible with the specified codon table. Only transitions that result in valid amino acid substitutions according to the codon table will be allowed.

Parameters:

codon_tablestr or Bio.Data.CodonTable: The NCBI code for an existing genetic code or a custom CodonTable object used to translate nucleotide sequences into proteins.

to_nucleotide_space(codon_table='Standard', alphabet_type='dna')

Convert a protein sequence space into a nucleotide sequence space.

This method transforms a protein sequence space into a nucleotide sequence space using a specified codon table for translation. The resulting nucleotide space will have 4 alleles per site and 3 times the number of sites as the original protein space. It assumes that the function associated with each nucleotide sequence depends only on the protein sequence it encodes.

Parameters:

codon_tablestr or Bio.Data.CodonTable: The NCBI code for an existing genetic code or a custom CodonTable object used to translate nucleotide sequences into proteins.
alphabet_typestr, optional, default=’dna’: The type of nucleotide sequence to use in the resulting space. Must be one of {‘dna’, ‘rna’}.

Returns:

SequenceSpace: A nucleotide sequence space with the specified properties.

Random walks

class gpmap.randwalk.WMWalk(space, log=None, Ns=None)

Class for Weak Mutation Random Walk on a SequenceSpace. This is a time-reversible continuous-time Markov Chain where the transition rates are determined by the differences in fitness between two states, scaled by the effective population size Ns.

The transition rate matrix Q(i, j) is defined as:

\[\begin{split}Q(i, j) = \begin{cases} M(i, j)\frac{S(i, j)}{1 - e^{S(i, j)}} & \text{if $i$ and $j$ are neighbors}\\ -\sum_{k\neq i} Q(i, k) & \text{if } i=j \\ 0 & \text{Otherwise}, \end{cases}\end{split}\]

where $M(i, j)$ is the time-reversible neutral mutation rate between $i$ and $j$ and $S(i, j)$ is the scaled fitness difference between $i$ and $j$, typically defined as $S(i, j) = Ns(f_j - f_i)$, where $f_i$ is the phenotype for state $i$

Methods

`calc_rate_matrix`([Ns, neutral_stat_freqs, ...])	Computes and stores the rate matrix for the random walk in the discrete space.
`calc_stationary_frequencies`([Ns, ...])	Calculates the stationary frequencies of states under the given evolutionary model.
`calc_visualization`([Ns, mean_function, ...])	Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk.
`set_Ns`([Ns, mean_function, ...])	Sets the scaled effective population size (Ns) or calculates it based on the desired mean function or percentile of the function values.
`write_tables`(prefix[, write_edges, ...])	Write the output of the visualization to files with a common prefix.

calc_rate_matrix(Ns=None, neutral_stat_freqs=None, neutral_exchange_rates=None)

Computes and stores the rate matrix for the random walk in the discrete space.

Parameters:

Nsfloat, optional: Scaled effective population size for the evolutionary model. If not provided, the value of self.Ns will be used.
neutral_stat_freqsarray-like of shape (n_states,), optional: Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.
neutral_exchange_ratesscipy.sparse.csr.csr_matrix of shape (n_states, n_states), optional: Sparse matrix containing the neutral exchange rates for the entire sequence space. If not provided, uniform mutational dynamics are assumed.

Notes

The resulting rate matrix is stored in the rate_matrix attribute.
The method also calculates the symmetrized rate matrix as an intermediate step.

calc_stationary_frequencies(Ns=None, neutral_stat_freqs=None)

Calculates the stationary frequencies of states under the given evolutionary model.

Parameters:

Nsfloat, optional: Scaled effective population size for the evolutionary model. If not provided, the value of self.Ns will be used.
neutral_stat_freqsarray-like of shape (n_states,), optional: Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.

Returns:

stationary_freqsarray-like of shape (n_states,): The stationary frequencies of states under the given evolutionary model.

calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=10, neutral_exchange_rates=None, neutral_stat_freqs=None, tol=1e-12)

Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist of the right eigenvectors of the associated rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time.

Parameters:

Nsfloat, optional: Scaled effective population size to use in the underlying evolutionary model. If not provided, it will be derived from mean_function or mean_function_perc.
mean_functionfloat, optional: Mean function at stationarity to derive the associated Ns. Either this or mean_function_perc must be provided if Ns is not specified.
mean_function_percfloat, optional: Percentile that the mean function at stationarity takes within the distribution of function values along sequence space. For example, if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values. Either this or mean_function must be provided if Ns is not specified.
n_componentsint, default=10: Number of eigenvectors or Diffusion axes to calculate.
neutral_stat_freqsarray-like of shape (n_states,), optional: Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, uniform stationary frequencies are assumed.
neutral_exchange_ratesscipy.sparse.csr.csr_matrix of shape: (n_states, n_states), optional Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.
tolfloat, default=1e-12: Tolerance for the eigendecomposition solver. Lower values result in higher precision but may increase computation time.

Notes

The visualization coordinates are stored in self.nodes_df, which includes the scaled eigenvectors, function values, and stationary frequencies for each state.
Relaxation times and decay rates are stored in self.decay_rates_df.

set_Ns(Ns=None, mean_function=None, mean_function_perc=None, neutral_stat_freqs=None, tol=0.0001)

Sets the scaled effective population size (Ns) or calculates it based on the desired mean function or percentile of the function values.

Parameters:

Nsfloat, optional: Scaled effective population size for the evolutionary model. If provided, it will be directly set. Must be non-negative.
mean_functionfloat, optional: Desired mean function value at stationarity. If provided, Ns will be optimized to achieve this value.
mean_function_percfloat, optional: Percentile of the function values to use as the desired mean function at stationarity. For example, if set to 98, the mean function will be set to the 98th percentile of the function values.
neutral_stat_freqsarray-like of shape (n_states,), optional: Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.
tolfloat, optional, default=1e-4: Tolerance for determining whether the mean function is close to the neutral mean function.

Raises:

ValueError: If none of Ns, mean_function, or mean_function_perc is provided.
ValueError: If mean_function_perc is not between 0 and 100.
ValueError: If mean_function is not between the neutral mean function and the maximum function value.

write_tables(prefix, write_edges=False, nodes_format='parquet', edges_format='npz')

Write the output of the visualization to files with a common prefix. The output can include up to three different tables, depending on the options provided:

Nodes coordinates: Contains the coordinates for each state, along with the associated function values and stationary frequencies. Stored in either CSV format with the suffix “nodes.csv” or Parquet format with the suffix “nodes.pq”.
Decay rates: Contains the decay rates and relaxation times associated with each component or diffusion axis. Stored in CSV format with the suffix “decay_rates.csv”.
Edges: Contains the adjacency relationships between states. This is not stored by default unless write_edges=True. Since the edges remain unchanged for any visualization on the same SequenceSpace, they only need to be stored once. Stored in either CSV format or the more efficient NPZ format for sparse matrices.

Parameters:

prefixstr: Prefix for the filenames used to store the tables.
write_edgesbool, optional, default=False: Whether to write the adjacency relationships between states (edges) to a file.
nodes_format{‘parquet’, ‘csv’}, optional, default=’parquet’: Format for storing the nodes information. Parquet is more efficient, but CSV can be used for smaller datasets or when plain text storage is preferred.
edges_format{‘npz’, ‘csv’}, optional, default=’npz’: Format for storing the edges information. NPZ is more efficient, but CSV can be used for smaller datasets or when plain text storage is preferred.

Plotting

Summary statistics

gpmap.plot.mpl.plot_correlation_distance(corr, axes, x='d', y='emp_cor')

Plot the correlation as a function of Hamming distance.

This function visualizes the relationship between the Hamming distance and the empirical correlation by plotting the data points and connecting them with a line. It also customizes the axes labels, limits, and ticks for better interpretability.

gpmap.plot.mpl.plot_correlation_U_sites(corr, axes, x='d_jittered', y='emp_cor')

Plot the correlation values for each distance class corresponding to all possible combinations of sites at which two sequences differ.

Each point represents a distance class. Each distance class corresponds to each of the possible subsets of sites (of any size) and are plotted according to a jittered Hamming distance class on the x-axis.

Parameters:

corrpd.DataFrame: DataFrame containing correlation data with at least columns for x and y coordinates, and index values representing sequences.
axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the landscape.
xstr, default=”d_jittered”: Column name in corr for the x-axis coordinates (jittered Hamming distance).
ystr, default=”emp_cor”: Column name in corr for the y-axis coordinates and correlation values to plot.

gpmap.plot.mpl.plot_interaction_matrix(matrix, axes, cmap='binary', vmax=None, scale_factor=1e-06, position_labels=None, xlabel='Site 1', ylabel='Site 2', cbar_label='Interaction strength ($a_{ij}$)')

Plots a heatmap of the estimated interaction strengths using local epistasis regression.

This plots the inverse of the regularization parameters for local epistatic interactions involving every pair of sites

Parameters:

matrixpd.DataFrame or np.ndarray: 2D array or DataFrame containing the matrix values to visualize.
axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the matrix.
cmapstr, default=”binary”: Colormap name to use for the heatmap.
vmaxfloat, optional: Maximum value for the colormap scale. If None, uses the maximum value in the matrix.
scale_factorfloat, default=1e-6: Scaling factor to apply to matrix values before plotting. Useful for displaying very large or very small numbers.
position_labelsarray-like, optional: Labels for the rows and columns. If None, uses the matrix index/columns.
xlabelstr, default=”Site 1”: Label for the x-axis.
ylabelstr, default=”Site 2”: Label for the y-axis.
cbar_labelstr, default=”Value”: Label for the colorbar.

gpmap.plot.mpl.plot_kth_variance_components(vc, axes, color='black', cum_color='grey', bar_ylim=(0, 50), cum_ylim=(0, 100))

Plot variance components for interaction orders.

Parameters:

vcpd.DataFrame: DataFrame containing variance components for a landscape.
axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the variance components.
colorstr, optional, default=”black”: Color for the bars representing variance percentages.
cum_colorstr, optional, default=”grey”: Color for the cumulative variance line and points.
bar_ylimtuple, optional, default=(0, 50): Y-axis limits for the variance percentage bars.
cum_ylimtuple, optional, default=(0, 100): Y-axis limits for the cumulative variance line.

gpmap.plot.mpl.plot_sites_variance_components(axes, sites, cmap='Greys', vmin=0, vmax=None, xlabel='Site', ylabel='Interaction order $k$', cbar_label='% variance explained')

Plot site-level variance components as a heatmap.

Parameters:

axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the heatmap.
sitespd.DataFrame: DataFrame of shape (n_orders, n_sites) containing variance component values for each interaction order and site.
cmapstr, default=”Greys”: Colormap name to use for the heatmap.
vminfloat, default=0: Minimum value for the colormap scale.
vmaxfloat, default=None: Maximum value for the colormap scale.
xlabelstr, default=”Site”: Label for the x-axis (sites).
ylabelstr, default=”Interaction order $k$”: Label for the y-axis (interaction orders).
cbar_labelstr, default=”% variance explained”: Label for the colorbar.

gpmap.plot.mpl.plot_site_pairs_variance_components(axes, matrix, cmap='Greys', vmin=0, vmax=60, xlabel='Site 1', ylabel='Site 2', cbar_label='% pairwise and higher-order\nvariance explained')

Plot pairwise site variance components as a heatmap.

Parameters:

axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the heatmap.
matrixpd.DataFrame: DataFrame of shape (n_sites, n_sites) containing pairwise and higher-order variance component values.
cmapstr, default=”Greys”: Colormap name to use for the heatmap.
vminfloat, default=0: Minimum value for the colormap scale.
vmaxfloat, default=60: Maximum value for the colormap scale.
xlabelstr, default=”Site 1”: Label for the x-axis.
ylabelstr, default=”Site 2”: Label for the y-axis.
cbar_labelstr, default=”% pairwise and higher-ordernvariance explained”: Label for the colorbar.

Matplotlib Backend

gpmap.plot.mpl.plot_nodes(axes, nodes_df, x='1', y='2', z=None, alpha=1, zorder=2, sort_by=None, sort_ascending=False, color='function', cmap='viridis', cbar=True, cbar_axes=None, cbar_label='Function', cbar_orientation='vertical', vcenter=None, vmax=None, vmin=None, palette='Set1', size=2.5, max_size=40, min_size=1, lw=0, edgecolor='black', legend=True, legend_loc=0, rasterized=False)

Plots the nodes representing the states of the discrete space on the provided coordinates.

Parameters:

axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the nodes or states.
nodes_dfpandas.DataFrame: DataFrame of shape (n_genotypes, n_components + 2) containing the coordinates in each of the n_components, along with additional columns such as “function” and “stationary_freq”. Additional columns are also allowed.
xstr, default=’1’: Column in nodes_df to use for plotting the genotypes on the x-axis.
ystr, default=’2’: Column in nodes_df to use for plotting the genotypes on the y-axis.
zstr, optional: Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced, provided the axes object supports it.
alphafloat, default=1: Transparency of markers representing the nodes.
zorderint, default=2: Order in which the nodes will be rendered relative to other elements. Typically, this should be greater than the zorder used for plotting edges.
colorstr, default=’grey’: Column name in nodes_df for the values used to color the nodes, or a specific color to use for all nodes.
vcenterbool, default=False: Whether to center the color scale around the 0 value.
vmaxfloat, optional: Maximum value for the colormap.
vminfloat, optional: Minimum value for the colormap.
cmapstr or matplotlib.colors.Colormap, default=’viridis’: Colormap to use for coloring the nodes based on the color column.
cbarbool, default=True: Whether to display the colorbar.
cbar_labelstr, optional: Label for the colorbar associated with the nodes’ color scale.
cbar_axesmatplotlib.axes.Axes, optional: Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes.
palettedict, optional: Dictionary mapping categories in the color column to specific colors, if the column represents categorical data.
sizefloat or str, default=2.5: Size of the markers for the nodes. If a float is provided, it will be used for all nodes. If a string is provided, node sizes will be scaled based on the corresponding column in nodes_df.
max_sizefloat, default=40: Maximum size for the nodes when scaled.
min_sizefloat, default=1: Minimum size for the nodes when scaled.
lwfloat, default=0: Line width of the edges around the markers representing the nodes.
edgecolorstr, default=’black’: Color of the edges around the markers representing the nodes.
legendbool, default=True: Whether to display a legend on the plot.
legend_locint or tuple, default=0: Location of the legend if coloring is based on a categorical variable.
rasterizedbool, default=False: Whether to rasterize the scatterplot when rendering the plot in vector format.

gpmap.plot.mpl.plot_edges(axes, nodes_df, edges_df, x='1', y='2', z=None, alpha=0.1, zorder=1, color='grey', cbar=True, cmap='binary', cbar_axes=None, cbar_orientation='vertical', cbar_label='', palette=None, legend=True, legend_loc=0, width=0.5, max_width=1, min_width=0.1, fontsize=None, rasterized=False)

Plots the edges representing connections between states in the discrete space under a particular embedding.

Parameters:

axesmatplotlib.axes.Axes: The matplotlib Axes object in which to plot the edges.
nodes_dfpandas.DataFrame: DataFrame of shape (n_genotypes, n_components + 2) containing the coordinates in each of the n_components, along with additional columns such as “function” and “stationary_freq”. Additional columns are also allowed.
edges_dfpandas.DataFrame: DataFrame of shape (n_edges, 2) containing the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indices of the pairs of states that are connected.
xstr, default=’1’: Column in nodes_df to use for plotting the genotypes on the x-axis.
ystr, default=’2’: Column in nodes_df to use for plotting the genotypes on the y-axis.
zstr, optional: Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced, provided the axes object supports it.
alphafloat, default=0.1: Transparency of the lines representing the edges.
zorderint, default=1: Order in which the edges will be rendered relative to other elements. Typically, this should be smaller than the zorder used for plotting nodes.
colorstr, default=’grey’: Column name in edges_df for the values used to color the edges, or a specific color to use for all edges.
cmapstr or matplotlib.colors.Colormap, default=’binary’: Colormap to use for coloring the edges based on the color column.
widthfloat or str: Width of the lines representing the edges. If a float is provided, it will be used for all edges. If a string is provided, edge widths will be scaled based on the corresponding column in edges_df.
max_widthfloat, default=1: Maximum width for the edges when scaled.
min_widthfloat, default=0.1: Minimum width for the edges when scaled.
rasterizedbool, optional, default=False: Whether to rasterize the plot for better performance with large datasets.

Returns:

line_collectionmatplotlib.collections.LineCollection or
mpl_toolkits.mplot3d.art3d.Line3DCollection: The collection of lines representing the edges.

gpmap.plot.mpl.plot_visualization(axes, nodes_df, edges_df=None, x='1', y='2', z=None, nodes_alpha=1, nodes_zorder=2, nodes_color='function', nodes_cmap='viridis', nodes_palette=None, nodes_vmin=None, nodes_vmax=None, nodes_vcenter=False, nodes_cbar=True, nodes_cbar_axes=None, nodes_cmap_label='Function', nodes_size=2.5, nodes_min_size=1, nodes_max_size=40, nodes_lw=0, nodes_edgecolor='black', edges_alpha=0.1, edges_zorder=1, edges_color='grey', edges_cmap='binary', edges_palete=None, edges_cbar=False, edges_cbar_axes=None, edges_width=0.5, edges_max_width=1, edges_min_width=0.1, sort_by=None, sort_ascending=True, center_spines=False, add_hist=False, inset_cbar=False, inset_pos=(0.7, 0.7), prev_nodes_df=None, rasterized=False)

Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states if provided.

Parameters:

axesmatplotlib.axes.Axes: Matplotlib Axes object in which to plot the edges and nodes.
nodes_dfpd.DataFrame of shape (n_genotypes, n_variables): DataFrame containing the coordinates in each of the n_components, in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed.
edges_dfpd.DataFrame of shape (n_edges, 2), optional: DataFrame containing the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected. If not provided, only nodes will be plotted.
xstr, optional, default=’1’: Column in nodes_df to use for plotting the genotypes on the x-axis.
ystr, optional, default=’2’: Column in nodes_df to use for plotting the genotypes on the y-axis.
zstr, optional, default=None: Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced.
nodes_alphafloat, optional, default=1: Transparency of the markers representing the nodes.
nodes_zorderint, optional, default=2: Order in which the nodes will be rendered relative to other elements.
nodes_colorstr, optional, default=’function’: Column name in nodes_df for the values used to color the nodes, or a specific color to use for all nodes.
nodes_cmapstr, optional, default=’viridis’: Colormap to use for coloring the nodes based on the nodes_color column.
nodes_palettedict, optional, default=None: Dictionary mapping categories in the nodes_color column to specific colors, if the column represents categorical data.
nodes_vminfloat, optional, default=None: Minimum value for the colormap.
nodes_vmaxfloat, optional, default=None: Maximum value for the colormap.
nodes_vcenterbool, optional, default=False: Whether to center the color scale around the 0 value.
nodes_cbarbool, optional, default=True: Whether to display the colorbar for the nodes.
nodes_cbar_axesmatplotlib.axes.Axes, optional, default=None: Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes.
nodes_cmap_labelstr, optional, default=’Function’: Label for the colorbar associated with the nodes’ color scale.
nodes_sizefloat, optional, default=2.5: Size of the markers for the nodes.
nodes_min_sizefloat, optional, default=1: Minimum size for the nodes when scaled.
nodes_max_sizefloat, optional, default=40: Maximum size for the nodes when scaled.
nodes_lwfloat, optional, default=0: Line width of the edges around the markers representing the nodes.
nodes_edgecolorstr, optional, default=’black’: Color of the edges around the markers representing the nodes.
edges_alphafloat, optional, default=0.1: Transparency of the lines representing the edges.
edges_zorderint, optional, default=1: Order in which the edges will be rendered relative to other elements.
edges_colorstr, optional, default=’grey’: Column name in edges_df for the values used to color the edges, or a specific color to use for all edges.
edges_cmapstr, optional, default=’binary’: Colormap to use for coloring the edges based on the edges_color column.
edges_palettedict, optional, default=None: Dictionary mapping categories in the edges_color column to specific colors, if the column represents categorical data.
edges_cbarbool, optional, default=False: Whether to display the colorbar for the edges.
edges_cbar_axesmatplotlib.axes.Axes, optional, default=None: Axes to plot the colorbar for the edges. If not provided, it will be automatically adjusted to the current Axes.
edges_widthfloat, optional, default=0.5: Width of the lines representing the edges.
edges_max_widthfloat, optional, default=1: Maximum width for the edges when scaled.
edges_min_widthfloat, optional, default=0.1: Minimum width for the edges when scaled.
sort_bystr, optional, default=None: Column in nodes_df to use for sorting the nodes before plotting.
sort_ascendingbool, optional, default=False: Whether to sort the nodes in ascending order based on the sort_by column.
center_spinesbool, optional, default=False: Whether to center the spines of the plot at (0, 0).
add_histbool, optional, default=False: Whether to add a histogram inset showing the distribution of node colors.
inset_cbarbool, optional, default=False: Whether to add an inset colorbar for the nodes.
inset_postuple, optional, default=(0.7, 0.7): Position of the inset colorbar or histogram as a fraction of the Axes.
prev_nodes_dfpd.DataFrame, optional, default=None: DataFrame containing the previous positions of the nodes. If provided, the current nodes will be aligned to minimize their distance from the previous positions.
rasterizedbool, optional, default=False: Whether to rasterize the plot for better performance with large datasets.

gpmap.plot.mpl.plot_relaxation_times(decay_df, axes=None, fpath=None, log_scale=False, neutral_time=None, kwargs={})

Plot relaxation times for calculated components.

Parameters:

decay_dfpd.DataFrame: DataFrame with shape (n_components, 3) containing decay rates and associated mean relaxation times for each calculated component.
axesmatplotlib.axes.Axes, optional: Axes object to plot on. If not provided, a new figure will be created and saved to the path specified by fpath.
fpathstr, optional: File path to save the plot. If None, the axes argument must be provided for plotting.
log_scalebool, default=False: Whether to plot relaxation times on a logarithmic scale.
neutral_timefloat, optional: If provided, a horizontal line representing the neutral process relaxation time will be added to the plot. Useful for selecting relevant dimensions.
kwargsdict, optional: Additional keyword arguments for axes.plot and axes.scatter, such as color or marker style.

gpmap.plot.mpl.figure_Ns_grid(rw, x='1', y='2', pmin=0, pmax=0.8, ncol=4, nrow=3, show_edges=True, fpath=None, **kwargs)

Generate a grid of visualizations for different stationary mean functions.

Parameters:

rwobject: An object containing the space and nodes information, as well as methods for calculating visualizations.
xstr, optional: Column in the nodes DataFrame to use for plotting the x-axis. Default is “1”.
ystr, optional: Column in the nodes DataFrame to use for plotting the y-axis. Default is “2”.
pminfloat, optional: Minimum proportion of the range to use for calculating mean functions. Default is 0.
pmaxfloat, optional: Maximum proportion of the range to use for calculating mean functions. Default is 0.8.
ncolint, optional: Number of columns in the grid. Default is 4.
nrowint, optional: Number of rows in the grid. Default is 3.
show_edgesbool, optional: Whether to include edges in the visualization. Default is True.
fpathstr, optional: File path to save the figure. If None, the figure will not be saved. Default is None.

gpmap.plot.mpl.figure_allele_grid(nodes_df, edges_df=None, allele_color='orange', background_color='lightgrey', positions=None, position_labels=None, colsize=3, rowsize=2.7, xpos_label=0.05, ypos_label=0.92, fmt='png', fpath=None, **kwargs)

Generate a grid of visualizations for alleles at specific positions.

Parameters:

nodes_dfpd.DataFrame: DataFrame containing the nodes’ information, including their coordinates and attributes.
edges_dfpd.DataFrame, optional: DataFrame containing the edges’ information, including connectivity between nodes. If None, edges will not be plotted. Default is None.
allele_colorstr, optional: Color used to highlight nodes corresponding to specific alleles. Default is “orange”.
background_colorstr, optional: Color used for the background nodes. Default is “lightgrey”.
positionsarray-like, optional: List of positions to visualize. If None, all positions will be used. Default is None.
position_labelsarray-like, optional: Labels for the positions. If None, positions will be labeled sequentially. Default is None.
colsizeint, optional: Width of each column in the grid. Default is 3.
rowsizefloat, optional: Height of each row in the grid. Default is 2.7.
xpos_labelfloat, optional: Horizontal position of the allele label within each subplot, as a fraction of the axes width. Default is 0.05.
ypos_labelfloat, optional: Vertical position of the allele label within each subplot, as a fraction of the axes height. Default is 0.92.
fmtstr, optional: Format to save the figure, e.g., “png” or “pdf”. Default is “png”.
fpathstr, optional: File path to save the figure. If None, the figure will not be saved. Default is None.

gpmap.plot.mpl.figure_SeqDEFT_summary(log_Ls, seq_density=None, err_bars='stderr', show_folds=False, legend_loc=1, normalize_logL=True)

Generates a 2-panel figure summarizing the SeqDEFT model results.

The first panel shows how the cross-validated likelihood changes with the a hyperparameter and highlights the best-selected value for model fitting. If sequence density data is provided, the second panel visualizes the relationship between observed sequence frequencies and estimated densities.

Parameters:

log_Lspd.DataFrame

A DataFrame of shape (num_a, 3) containing the columns:

a: The hyperparameter values.
logL: The log-likelihood values.
fold: The cross-validation fold identifiers.

seq_densitypd.DataFrame, optional

A DataFrame of shape (n_genotypes, >= 2) with the following columns:

frequency: Observed frequencies for each sequence.
Q: Estimated densities for each sequence.

If not provided, only a single-panel figure with the cross-validated likelihood curve will be generated.

err_barsstr, default=’stderr’

Specifies the type of error bars to display:

'sd': Standard deviation across the different folds.
'stderr': Standard error of the mean.

show_foldsbool, default=False

Whether to display the out-of-sample log-likelihoods for the individual folds in the cross-validation procedure.

legend_locint, default=1

The location of the legend in the plot. Follows matplotlib’s legend location codes.

normalize_logLbool, default=True

If True, normalizes the log-likelihood values relative to the value at a = ∞.

Returns:

figmatplotlib.figure.Figure: The resulting figure object containing the generated plots.

Plotly Backend

gpmap.plot.ply.plot_visualization(nodes_df, edges_df=None, x='1', y='2', z=None, nodes_color='function', nodes_size=4, nodes_cmap='viridis', nodes_cmap_label='Function', edges_width=0.5, edges_color='#888', edges_alpha=0.2, text=None, fpath=None)

Creates an interactive plot of a fitness landscape with genotypes as nodes and single point mutations as edges using Plotly.

Parameters:

nodes_dfpd.DataFrame: DataFrame containing genotype information. Must include columns for coordinates (e.g., “1”, “2”, “3”), “function”, and optionally other metadata.
edges_dfpd.DataFrame, optional: DataFrame containing edge connectivity information. Must include columns “i” and “j” for connected node indices.
xstr, default ‘1’: Column name in nodes_df for the x-axis coordinates.
ystr, default ‘2’: Column name in nodes_df for the y-axis coordinates.
zstr, optional: Column name in nodes_df for the z-axis coordinates. If provided, a 3D plot will be generated.
nodes_colorstr, default ‘function’: Column name in nodes_df for node coloring or a specific color value.
nodes_sizefloat, default 4: Size of the nodes. Can be a constant or a column name in nodes_df.
nodes_cmapstr, default ‘viridis’: Colormap for node coloring.
nodes_cmap_labelstr, default ‘Function’: Label for the colorbar associated with node coloring.
edges_widthfloat, default 0.5: Width of the edges. Can be a constant or a column name in edges_df.
edges_colorstr, default ‘#888’: Color of the edges.
edges_alphafloat, default 0.2: Transparency of the edges.
textarray-like, optional: Labels for nodes to display on hover. Defaults to nodes_df.index.
fpathstr, optional: File path to save the interactive plot as an HTML file.

Returns:

figplotly.graph_objects.Figure: The generated Plotly figure.

Datashader Backend

gpmap.plot.ds.plot_nodes(nodes_df, x='1', y='2', color='function', cmap='viridis', vmin=None, vmax=None, size=5, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=True, shade=True, resolution=800, square=False)

Plot nodes with various customization options.

Parameters:

nodes_dfpandas.DataFrame: DataFrame containing node data.
xstr, optional: Column name for the x-axis, by default “1”.
ystr, optional: Column name for the y-axis, by default “2”.
colorstr, optional: Column name for the color values, by default “function”.
cmapstr, optional: Colormap to use for coloring nodes, by default “viridis”.
vminfloat, optional: Minimum value for color scaling, by default None.
vmaxfloat, optional: Maximum value for color scaling, by default None.
sizeint, optional: Size of the nodes, by default 5.
linewidthint, optional: Line width of the node edges, by default 0.
edgecolorstr, optional: Color of the node edges, by default “black”.
sort_bystr, optional: Column name to sort nodes by, by default None.
sort_ascendingbool, optional: Whether to sort nodes in ascending order, by default True.
shadebool, optional: Whether to use datashader for rendering, by default True.
resolutionint, optional: Resolution of the plot, by default 800.
squarebool, optional: Whether to enforce a square aspect ratio, by default False.

Returns:

holoviews.Element: A Holoviews element representing the plotted nodes.

gpmap.plot.ds.plot_edges(nodes_df, edges_df, x='1', y='2', cmap='grey', width=0.5, alpha=0.2, color='grey', shade=True, resolution=800, square=True)

Plot edges.

Parameters:

nodes_dfpandas.DataFrame: DataFrame containing node data.
edges_dfpandas.DataFrame: DataFrame containing edge data.
xstr, optional: Column name for the x-axis, by default “1”.
ystr, optional: Column name for the y-axis, by default “2”.
cmapstr, optional: Colormap to use for coloring edges, by default “grey”.
widthfloat, optional: Line width of the edges, by default 0.5.
alphafloat, optional: Transparency level of the edges, by default 0.2.
colorstr, optional: Color of the edges, by default “grey”.
shadebool, optional: Whether to use datashader for rendering, by default True.
resolutionint, optional: Resolution of the plot, by default 800.
squarebool, optional: Whether to enforce a square aspect ratio, by default True.

Returns:

holoviews.Element: A Holoviews element representing the plotted edges.

gpmap.plot.ds.plot_visualization(nodes_df, x='1', y='2', edges_df=None, nodes_color='function', nodes_cmap='viridis', nodes_size=5, nodes_vmin=None, nodes_vmax=None, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=False, edges_width=0.5, edges_alpha=1, edges_color='grey', edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, shade_nodes=True, shade_edges=True, square=True)

Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states if provided.

Parameters:

nodes_dfpandas.DataFrame: DataFrame containing node data.
xstr, optional: Column name for the x-axis, by default “1”.
ystr, optional: Column name for the y-axis, by default “2”.
edges_dfpandas.DataFrame, optional: DataFrame containing edge data, by default None.
nodes_colorstr, optional: Column name for the color values, by default “function”.
nodes_cmapstr, optional: Colormap to use for coloring nodes, by default “viridis”.
nodes_sizeint, optional: Size of the nodes, by default 5.
nodes_vminfloat, optional: Minimum value for color scaling, by default None.
nodes_vmaxfloat, optional: Maximum value for color scaling, by default None.
linewidthint, optional: Line width of the node edges, by default 0.
edgecolorstr, optional: Color of the node edges, by default “black”.
sort_bystr, optional: Column name to sort nodes by, by default None.
sort_ascendingbool, optional: Whether to sort nodes in ascending order, by default False.
edges_widthfloat, optional: Line width of the edges, by default 0.5.
edges_alphafloat, optional: Transparency level of the edges, by default 1.
edges_colorstr, optional: Color of the edges, by default “grey”.
edges_cmapstr, optional: Colormap to use for coloring edges, by default “grey”.
background_colorstr, optional: Background color of the plot, by default “white”.
nodes_resolutionint, optional: Resolution of the nodes plot, by default 800.
edges_resolutionint, optional: Resolution of the edges plot, by default 1200.
shade_nodesbool, optional: Whether to use datashader for rendering nodes, by default True.
shade_edgesbool, optional: Whether to use datashader for rendering edges, by default True.
squarebool, optional: Whether to enforce a square aspect ratio, by default True.

Returns:

holoviews.Element: A Holoviews element representing the plotted visualization.

gpmap.plot.ds.dsg_to_fig(dsg)

Convert a Holoviews element to a Matplotlib figure.

Parameters:

dsgholoviews.Element: A Holoviews element to be converted.

Returns:

matplotlib.figure.Figure: A Matplotlib figure object representing the Holoviews element.

gpmap.plot.ds.figure_allele_grid(nodes_df, fpath, x='1', y='2', edges_df=None, positions=None, position_labels=None, edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, sort_by=None, sort_ascending=False, fmt='png', figsize=None, square=True, **kwargs)

Generate a grid of allele visualizations and save the resulting figure.

Parameters:

nodes_dfpandas.DataFrame: DataFrame containing node data.
fpathstr: File path to save the resulting figure.
xstr, optional: Column name for the x-axis, by default “1”.
ystr, optional: Column name for the y-axis, by default “2”.
edges_dfpandas.DataFrame, optional: DataFrame containing edge data, by default None.
positionslist or numpy.ndarray, optional: List or array of positions to visualize, by default None.
position_labelslist or numpy.ndarray, optional: Labels for the positions, by default None.
edges_cmapstr, optional: Colormap to use for coloring edges, by default “grey”.
background_colorstr, optional: Background color of the plot, by default “white”.
nodes_resolutionint, optional: Resolution of the nodes plot, by default 800.
edges_resolutionint, optional: Resolution of the edges plot, by default 1200.
sort_bystr, optional: Column name to sort nodes by, by default None.
sort_ascendingbool, optional: Whether to sort nodes in ascending order, by default False.
fmtstr, optional: Format to save the figure, by default “png”.
figsizetuple, optional: Size of the figure in inches, by default None.
squarebool, optional: Whether to enforce a square aspect ratio, by default True.

Datasets

gpmap.datasets.list_available_datasets()

Retrieve the names of all available built-in datasets.

This function scans the directory specified by LANDSCAPES_DIR and extracts the names of all files present, excluding their extensions. It returns these names as a list.

Returns:: list: A list of strings, where each string is the name of a built-in dataset.

class gpmap.datasets.DataSet(dataset_name, data=None, landscape=None)

DataSet object for managing and manipulating various components related to a specific dataset. This includes the original data, reconstructed landscape, and visualization coordinates.

Parameters:

dataset_namestr: The name of the dataset to load from the built-in list. If data or landscape are provided, this will be the name assigned to the new dataset.
datapd.DataFrame, shape (n_obs, n_features), optional: A DataFrame containing the experimental data with genotypes as the index.
landscapepd.DataFrame, shape (n_genotypes, 1), optional: A DataFrame containing the complete combinatorial landscape used to build the remaining components of the dataset.

Attributes:

data
edges
landscape
nodes
relaxation_times

Methods

`calc_visualization`([Ns, mean_function, ...])	Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk.
`plot`()	Makes a two panel figure with the relaxation times associated to the computed Diffusion axes and a low dimensional representation of the complete genotype-phenotype map from this `DataSet`.
`save`([fdir])	Saves the dataset to disk for direct access within the library.
`to_sequence_space`()	Generate a SequenceSpace object from the dataset's landscape.

calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=20)

Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist of the right eigenvectors of the associated rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time.

Parameters:

Nsfloat, optional: Scaled effective population size to use in the underlying evolutionary model. If not provided, it will be derived from mean_function or mean_function_perc.
mean_functionfloat, optional: Mean function at stationarity to derive the associated Ns. Either this or mean_function_perc must be provided if Ns is not specified.
mean_function_percfloat, optional: Percentile that the mean function at stationarity takes within the distribution of function values along sequence space. For example, if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values. Either this or mean_function must be provided if Ns is not specified.
n_componentsint, default=10: Number of eigenvectors or Diffusion axes to calculate.

plot(): Makes a two panel figure with the relaxation times associated to the computed Diffusion axes and a low dimensional representation of the complete genotype-phenotype map from this DataSet.

save(fdir=None)

Saves the dataset to disk for direct access within the library.

This method stores raw data, inferred genotype-phenotype map, and the computed visualization coordinates and relaxation times when available.

Parameters:

fdirstr, optional: Directory where the dataset should be saved. If not provided, the default directories defined in the settings will be used.

Notes

The dataset is stored by default in the installation folder and will be deleted upon re-installation.
If a custom directory is provided, the dataset will be saved with specific suffixes for each component.

to_sequence_space()

Generate a SequenceSpace object from the dataset’s landscape.

This method constructs a SequenceSpace object using the genotypes and their corresponding values from the dataset’s landscape.

Returns:

SequenceSpace: A SequenceSpace object representing the dataset’s landscape.

Utilities

Input/Output

gpmap.utils.read_dataframe(fpath)

Read a DataFrame from a file path in different formats.

Parameters:

fpathstr: Path to the file containing the DataFrame. The file extension determines the format: ‘csv’ or ‘parquet’.

Returns:

pd.DataFrame: The DataFrame read from the file.

Raises:

ValueError: If the file format is not recognized.

gpmap.utils.read_edges(fpath, log=None, return_df=True)

Reads the incidence matrix containing the adjacency information among genotypes from a sequence space.

Parameters:

fpathstr: File path containing the edges of a sequence space. The extension will be used to differentiate between csv, parquet, and the more efficient npz format.
logLogTrack, optional: Logger instance to log messages. Default is None.
return_dfbool, default=True: Whether to return a pandas DataFrame with the edges. If False, it will return a csr_matrix.

Returns:

edges_dfpd.DataFrame or csr_matrix: If return_df is True, returns a DataFrame with columns i and j containing the indices of the genotypes that are separated by a single mutation in a sequence space. If return_df is False, returns a csr_matrix representation of the edges.

Genotype dataframes

gpmap.genotypes.select_genotypes(nodes_df, genotypes, edges=None, is_idx=False)

Select the specified genotypes from nodes_df, along with the corresponding edges among the remaining genotypes if edges are provided.

Parameters:

nodes_dfpd.DataFrame: A DataFrame containing genotypes as the index and various features as columns. Typically, it includes at least the coordinates for visualization, but it may also retain other metadata.
genotypesarray-like: An array of genotypes to select from the input landscape. By default, it should contain genotype labels, or indexes if the is_idx option is set to True.
edgespd.DataFrame or scipy.sparse.csr_matrix, optional: A DataFrame or csr_matrix representing the adjacency relationships among genotypes provided in nodes_df within the discrete space. Defaults to None.
is_idxbool, optional: Indicates whether the genotypes argument is an array of indexes instead of an array of genotype labels. Defaults to False.

Returns:

pd.DataFrame or tuple: If edges is None, returns a DataFrame containing the filtered genotypes. Otherwise, returns a tuple containing the filtered DataFrame and the adjacency relationships between the selected genotypes.

gpmap.genotypes.get_genotypes_from_region(nodes_df, max_values={}, min_values={})

Filter and return the genotype labels that satisfy the specified conditions based on maximum and minimum values for the columns in the input DataFrame.

Parameters:

nodes_dfpd.DataFrame: DataFrame with genotypes as the index and various features as columns. Typically, it contains at least the coordinates for visualization, but it may also include other metadata.
max_valuesdict, optional: Dictionary where keys are column names and values are the maximum thresholds for filtering genotypes. Genotypes with values greater than these thresholds in the specified columns will be excluded.
min_valuesdict, optional: Dictionary where keys are column names and values are the minimum thresholds for filtering genotypes. Genotypes with values less than these thresholds in the specified columns will be excluded.

Returns:

pd.Index: Index containing the labels of genotypes that meet the specified filtering criteria.

gpmap.genotypes.marginalize_landscape_positions(nodes_df, keep_pos=None, skip_pos=None, return_edges=False)

Marginalize specific positions in the sequences and compute the average of numeric values across the remaining genetic backgrounds.

Parameters:

nodes_dfpd.DataFrame: A DataFrame with sequence names as the index and at least one numeric column to compute the average across the selected genetic backgrounds.
keep_posarray-like, optional: A list of 0-indexed positions to retain. The sequences will be averaged across all genetic backgrounds specified by the positions not included in this list. If not provided, skip_pos must be specified.
skip_posarray-like, optional: A list of 0-indexed positions to marginalize out. The sequences will be averaged across these positions. If not provided, keep_pos must be specified.
return_edgesbool, optional, default=False: If True, returns an additional DataFrame containing the edges of the reduced sequence space for visualization purposes.

Returns:

nodes_dfpd.DataFrame: A DataFrame containing the average value of every numeric column in the input DataFrame, with the subsequences at the desired positions as the index.
edges_dfpd.DataFrame, optional: A DataFrame containing the edges of the reduced sequence space. This is only returned if return_edges=True.

Sequence handling

gpmap.seq.get_custom_codon_table(aa_mapping)

Constructs a Biopython CodonTable for translation using a custom genetic code.

Parameters:

aa_mappingpd.DataFrame: A pandas DataFrame with columns “Codon” and “Letter” representing the genetic code mapping. Stop codons should be denoted with “*”.

Returns:

codon_tableBio.Data.CodonTable.CodonTable: A Biopython CodonTable object that can be used for translating sequences with the specified custom genetic code.

gpmap.seq.generate_freq_reduced_code(seqs, n_alleles, counts=None, keep_allele_names=True, last_character='X')

Generate a mapping from each allele in the observed sequences to a reduced alphabet with at most n_alleles per site. The least frequent alleles are grouped into a single allele.

Parameters:

seqsarray-like of shape (n_genotypes,) or (n_obs,): Observed sequences. If counts is None, each sequence is assumed to appear once. Otherwise, frequencies are calculated using the counts as the number of times a sequence appears in the data.
n_allelesint or array-like of shape (seq_length,): Maximum number of alleles allowed per site. If an array is provided, each site will use the specified number of alleles. Otherwise, all sites will have the same maximum number of alleles.
countsNone or array-like of shape (n_genotypes,): Number of times each sequence in seqs appears in the data. If not provided, each sequence is assumed to appear exactly once.
keep_allele_namesbool, optional: If True, allele names are preserved. Otherwise, they are replaced by new alleles taken from the alphabet. Default is True.
last_characterstr, optional: Character to use for pooled alleles when keep_allele_names is True. Default is “X”.

Returns:

codelist of dict of length seq_length: A list of dictionaries, where each dictionary maps the original alleles to the new reduced alphabet for each site.

gpmap.seq.msa_to_counts(X, y=None, positions=None, phylo_correction=False, max_dist=0.2)

Extracts unique sequences and their counts from a Multiple Sequence Alignment (MSA). Optionally, subsequences can be selected based on specific positions, and sequence identity re-weighting can be applied to account for sequence similarities across the full alignment.

Parameters:

Xarray-like of aligned sequences: Input sequences from which to extract unique sequences and counts.
yarray-like of weights, optional (default=None): Pre-calculated weights associated with the input sequences. If not provided, weights are calculated based on sequence identity.
positionsarray-like of int, optional (default=None): Subset of positions to extract subsequences from the MSA. If not provided, the full sequences are used.
phylo_correctionbool, optional (default=False): If True, applies sequence identity re-weighting. Observations are weighted as 1 divided by the number of similar sequences in the MSA. Similar sequences are defined based on the max_dist parameter.
max_distfloat, optional (default=0.2): Maximum sequence identity distance for considering sequences as similar during re-weighting. Only used if phylo_correction is True.

Returns:

Xnp.array of shape (n_unique_seqs,): Unique subsequences at the specified positions in the MSA.
ynp.array of shape (n_unique_seqs,): Counts or re-weighted counts for each unique subsequence in the MSA.