API Reference
Inference
Regression
- class gpmap.inference.MinimumEpistasisInterpolator(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', P=2, a=None, cg_rtol=0.0001)
Mininum epistasis interpolation model for sequence-function relationships.
A class for performing Minimum Epistasis Interpolation (MEI) to infer complete genotype-phenotype maps from incomplete and noisy data. This model applies a prior that penalizes local epistatic coefficients of order P and infers the posterior distribution based on experimental data for a subset of sequences.
- Parameters:
- n_allelesint, optional
The number of alleles per site. If not provided, it will be inferred from the provided data.
- seq_lengthint, optional
The length of the genotype sequences. If not provided, it will be inferred from the provided data.
- genotypesarray-like, optional
A list or array of genotypes to be used in the interpolation. If not provided, the model will infer the genotype space.
- alphabet_typestr, optional
The type of alphabet used for genotypes. Default is “custom”.
- Pint, optional
The order of epistasis to consider. Default is 2. This determines the level of interaction between genetic sites that is penalized.
- afloat, optional
The regularization parameter. If not provided, it will be inferred during the fitting process to best match the observed data.
- cg_rtolfloat, optional
The relative tolerance for the conjugate gradient solver. Default is 1e-5. This controls the precision of the solver used in computations.
Methods
fit(X, y[, y_var])Fits the Minimum Epistasis Interpolation (MEI) model hyperparameter to the provided data.
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
predict([X_pred, calc_variance])Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
Generate a sample from the prior distribution.
simulate([X, y_var, p_missing, seed])Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- fit(X, y, y_var=None)
Fits the Minimum Epistasis Interpolation (MEI) model hyperparameter to the provided data.
This method infers the optimal regularization parameter a by computing the Minimum Epistasis Interpolation solution. It determines the value of a such that the expected average squared Pth epistatic coefficients match those of the MEI solution.
- Parameters:
- Xarray-like of shape (n_obs,)
Array containing the genotypes for which observations are provided in y.
- yarray-like of shape (n_obs,)
Array containing the observed phenotypes corresponding to the genotypes in X.
- y_vararray-like of shape (n_obs,), optional
Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.
- Parameters:
- contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)
A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.
- Returns:
- contrastspd.DataFrame of shape (n_contrasts, 5)
A DataFrame summarizing the posterior distribution for each contrast. The columns include:
estimate: Posterior mean for each contrast.std: Posterior standard deviation for each contrast.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.
- predict(X_pred=None, calc_variance=False)
Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
- Parameters:
- X_predarray-like of shape (n_genotypes,), optional
Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
- calc_variancebool, optional, default=False
If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.
- Returns:
- predpd.DataFrame of shape (n_genotypes, n_columns)
A DataFrame containing the predicted phenotypes for each input genotype in the column
f. Ifcalc_variance=True, additional columns are included:f_var: Posterior variance for each genotype.f_std: Posterior standard deviation for each genotype.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.
The genotype labels are used as the row index.
Notes
The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.
Examples
Predict phenotypes for the entire genotype space:
>>> pred = model.predict()
Predict phenotypes for specific genotypes with variance:
>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)
- sample_prior()
Generate a sample from the prior distribution.
This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.
- Returns:
- f: numpy.ndarray
A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.
- simulate(X=None, y_var=0.0, p_missing=0, seed=None)
Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- Xarray-like, optional
Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
- y_varfloat or array-like, optional
Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
- p_missingfloat, optional
Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
- seedfloat, optional
Random seed for reproducibility. Default is None.
- Returns:
- farray-like
The true simulated measurements without experimental noise.
- Xarray-like
The input sequences used for the simulation.
- yarray-like
The simulated measurements with experimental noise added.
- y_vararray-like
The standard deviation of the experimental noise for each input sequence.
- Raises:
- ValueError
If the shape of y_var does not match the expected dimensions.
Examples
Simulate data with default parameters:
>>> f, X, y, y_var = gp.simulate()
Simulate data with custom noise and missing probability:
>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)
- class gpmap.inference.VCregression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', lambdas=None, beta=0, cross_validation=False, nfolds=5, cv_loss_function='frobenius_norm', num_beta=20, min_log_beta=-2, max_log_beta=7, cg_rtol=0.0001, progress=True)
Variance Component regression model for sequence-function relationships.
This model enables the inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior. The prior is parameterized by the contribution of different orders of interaction to the observed genetic variability of a continuous phenotype.
- Parameters:
- n_allelesint, optional
The number of alleles per site. If not provided, it will be inferred from the data.
- seq_lengthint, optional
The length of the genotype sequences. If not provided, it will be inferred from the data.
- genotypesarray-like, optional
A list or array of genotypes to be used in the interpolation.
- alphabet_typestr, optional
The type of alphabet used for genotypes. Default is “custom”.
- lambdasarray-like, optional
Variance components for each order of interaction. If not provided, they will be inferred during fitting.
- betafloat, optional
The regularization parameter for the kernel alignment. Default is 0.
- cross_validationbool, optional
Whether to perform cross-validation to select the best penalization constant for regularized variance component inference. Default is False.
- nfoldsint, optional
The number of folds for cross-validation. Default is 5.
- cv_loss_functionstr, optional
The loss function to use during cross-validation. Options are “frobenius_norm”, “logL”, or “r2”. Default is “frobenius_norm”.
- num_betaint, optional
The number of beta values to evaluate during cross-validation. Default is 20.
- min_log_betafloat, optional
The minimum log10(beta) value for cross-validation. Default is -2.
- max_log_betafloat, optional
The maximum log10(beta) value for cross-validation. Default is 7.
- cg_rtolfloat, optional
The relative tolerance for the conjugate gradient solver. Default is 1e-5.
- progressbool, optional
Whether to display progress bars during fitting. Default is True.
Methods
fit(X, y[, y_var, method])Infers the Variance Components from the provided data.
get_variance_components([lambdas])Return the variance components as a DataFrame from :math:`lambda`s.
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
predict([X_pred, calc_variance])Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
Generate a sample from the prior distribution.
simulate([X, y_var, p_missing, seed])Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- fit(X, y, y_var=None, method='L-BFGS-B')
Infers the Variance Components from the provided data.
This method infers the variance components, which represent the relative contribution of different orders of interaction to the variability in the sequence-function relationships. Variance components are determined through kernel alignment with the empirical distance-covariance function.
After fitting, the optimal variance components (lambdas) are stored in the VCregression.lambdas attribute for use in predictions.
- Parameters:
- Xarray-like of shape (n_obs,)
Array containing the genotypes for which observations are provided in y.
- yarray-like of shape (n_obs,)
Array containing the observed phenotypes corresponding to the genotypes in X.
- y_vararray-like of shape (n_obs,), optional
Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
- methodstr, optional
Optimization method to use during kernel alignment. Default is ‘L-BFGS-B’.
- get_variance_components(lambdas=None)
Return the variance components as a DataFrame from :math:`lambda`s.
- Parameters:
- lambdasarray-like, optional
An array of eigenvalues representing the variance components. If not provided, the model’s current lambdas attribute will be used.
- Returns:
- pandas.DataFrame
A DataFrame containing the following columns:
k: Index of the variance component (ranging from 0 to seq_length).lambdas: The input eigenvalues.var_perc: The percentage of variance explained by each component.var_perc_cum: The cumulative percentage of variance explained.
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.
- Parameters:
- contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)
A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.
- Returns:
- contrastspd.DataFrame of shape (n_contrasts, 5)
A DataFrame summarizing the posterior distribution for each contrast. The columns include:
estimate: Posterior mean for each contrast.std: Posterior standard deviation for each contrast.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.
- predict(X_pred=None, calc_variance=False)
Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
- Parameters:
- X_predarray-like of shape (n_genotypes,), optional
Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
- calc_variancebool, optional, default=False
If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.
- Returns:
- predpd.DataFrame of shape (n_genotypes, n_columns)
A DataFrame containing the predicted phenotypes for each input genotype in the column
f. Ifcalc_variance=True, additional columns are included:f_var: Posterior variance for each genotype.f_std: Posterior standard deviation for each genotype.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.
The genotype labels are used as the row index.
Notes
The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.
Examples
Predict phenotypes for the entire genotype space:
>>> pred = model.predict()
Predict phenotypes for specific genotypes with variance:
>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)
- sample_prior()
Generate a sample from the prior distribution.
This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.
- Returns:
- f: numpy.ndarray
A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.
- simulate(X=None, y_var=0.0, p_missing=0, seed=None)
Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- Xarray-like, optional
Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
- y_varfloat or array-like, optional
Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
- p_missingfloat, optional
Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
- seedfloat, optional
Random seed for reproducibility. Default is None.
- Returns:
- farray-like
The true simulated measurements without experimental noise.
- Xarray-like
The input sequences used for the simulation.
- yarray-like
The simulated measurements with experimental noise added.
- y_vararray-like
The standard deviation of the experimental noise for each input sequence.
- Raises:
- ValueError
If the shape of y_var does not match the expected dimensions.
Examples
Simulate data with default parameters:
>>> f, X, y, y_var = gp.simulate()
Simulate data with custom noise and missing probability:
>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)
- class gpmap.inference.ConnectednessModelRegression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', mu=None, cg_rtol=0.0001, progress=True)
Connectedness model regression for sequence-function relationships.
This model enables the inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior. The prior is parameterized by parameters controlling the effect of mutations at specific sites on the predictability of other mutations.
- Parameters:
- n_allelesint, optional
The number of alleles per site. If not provided, it will be inferred from the data.
- seq_lengthint, optional
The length of the genotype sequences. If not provided, it will be inferred from the data.
- genotypesarray-like, optional
A list or array of genotypes to be used in the interpolation.
- alphabet_typestr, optional
The type of alphabet used for genotypes. Default is “custom”.
- muarray-like, optional
Factors controlling the site-specific decay factors. If not provided, they will be inferred during fitting.
- cg_rtolfloat, optional
The relative tolerance for the conjugate gradient solver. Default is 1e-4.
- progressbool, optional
Whether to display progress bars during fitting. Default is True.
Methods
fit(X, y[, y_var, method])Infers the site-specific decay factors from the provided data.
Return the decay factors as a DataFrame from :math:`mu`s.
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
predict([X_pred, calc_variance])Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
Generate a sample from the prior distribution.
simulate([X, y_var, p_missing, seed])Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- fit(X, y, y_var=None, method='L-BFGS-B')
Infers the site-specific decay factors from the provided data.
This method infers the site-specific decay factors, which represent control the expected decrease in the predictability of other mutations in the presence of mutations at each site. Decay factors are inferred through kernel alignment with the empirical distance-covariance function.
After fitting, the optimal decay factors are used to build a Gaussian process prior for inference.
- Parameters:
- Xarray-like of shape (n_obs,)
Array containing the genotypes for which observations are provided in y.
- yarray-like of shape (n_obs,)
Array containing the observed phenotypes corresponding to the genotypes in X.
- y_vararray-like of shape (n_obs,), optional
Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
- methodstr, optional
Optimization method to use during kernel alignment. Default is ‘L-BFGS-B’.
- get_decay_factors()
Return the decay factors as a DataFrame from :math:`mu`s.
- Returns:
- pandas.DataFrame
A DataFrame containing the following columns:
p: 0-indexed position in the sequence.mu: The \(\mu\) value associated to each position.decay_factor: Decay factor associated to each position.
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.
- Parameters:
- contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)
A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.
- Returns:
- contrastspd.DataFrame of shape (n_contrasts, 5)
A DataFrame summarizing the posterior distribution for each contrast. The columns include:
estimate: Posterior mean for each contrast.std: Posterior standard deviation for each contrast.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.
- predict(X_pred=None, calc_variance=False)
Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
- Parameters:
- X_predarray-like of shape (n_genotypes,), optional
Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
- calc_variancebool, optional, default=False
If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.
- Returns:
- predpd.DataFrame of shape (n_genotypes, n_columns)
A DataFrame containing the predicted phenotypes for each input genotype in the column
f. Ifcalc_variance=True, additional columns are included:f_var: Posterior variance for each genotype.f_std: Posterior standard deviation for each genotype.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.
The genotype labels are used as the row index.
Notes
The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.
Examples
Predict phenotypes for the entire genotype space:
>>> pred = model.predict()
Predict phenotypes for specific genotypes with variance:
>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)
- sample_prior()
Generate a sample from the prior distribution.
This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.
- Returns:
- f: numpy.ndarray
A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.
- simulate(X=None, y_var=0.0, p_missing=0, seed=None)
Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- Xarray-like, optional
Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
- y_varfloat or array-like, optional
Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
- p_missingfloat, optional
Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
- seedfloat, optional
Random seed for reproducibility. Default is None.
- Returns:
- farray-like
The true simulated measurements without experimental noise.
- Xarray-like
The input sequences used for the simulation.
- yarray-like
The simulated measurements with experimental noise added.
- y_vararray-like
The standard deviation of the experimental noise for each input sequence.
- Raises:
- ValueError
If the shape of y_var does not match the expected dimensions.
Examples
Simulate data with default parameters:
>>> f, X, y, y_var = gp.simulate()
Simulate data with custom noise and missing probability:
>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)
- class gpmap.inference.LocalEpistasisRegression(n_alleles=None, seq_length=None, genotypes=None, alphabet_type='custom', P=2, a_values=None, lambda_U_lower_than_P=None, cg_rtol=0.0001, progress=True)
Local epistasis regression model for sequence-function relationships.
A class for performing Local Epistasis Regression (LER) to infer complete genotype-phenotype maps from incomplete and noisy data. This model applies a prior that penalizes local epistatic coefficients of order P differently depending on the combinations of P sites and infers the posterior distribution based on experimental data for a subset of sequences.
- Parameters:
- n_allelesint, optional
The number of alleles per site. If not provided, it will be inferred from the provided data.
- seq_lengthint, optional
The length of the genotype sequences. If not provided, it will be inferred from the provided data.
- genotypesarray-like, optional
A list or array of genotypes to be used in the model. If not provided, the model will infer the genotype space.
- alphabet_typestr, optional
The type of alphabet used for genotypes. Default is “custom”.
- Pint, optional
The order of epistasis to consider. Default is 2. This determines the level of interaction between genetic sites that is penalized.
- a_valuesarray-like, optional
The regularization parameters for each interaction order. If not provided, they will be inferred during the fitting process to best match the observed data.
- lambda_U_lower_than_Parray-like, optional
The regularization parameters for interactions with order lower than P. If not provided, it will be inferred during fitting.
- cg_rtolfloat, optional
The relative tolerance for the conjugate gradient solver. Default is 1e-16. This controls the precision of the solver used in computations.
- progressbool, optional
Whether to display progress bars during fitting. Default is True.
Methods
fit(X, y[, y_var, method])Fits the Local Epistasis Regression (LER) model hyperparameters to the provided data.
get_a_values([position_labels])Return a DataFrame of interaction-specific regularization parameters.
Compute empirical and predicted correlations for pairs of sequences differing at all possible combinations of sites U.
get_lambda_U_values([position_labels])Return a DataFrame of interaction-specific lambda values for interactions U with order lower than P.
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
predict([X_pred, calc_variance])Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
Generate a sample from the prior distribution.
simulate([X, y_var, p_missing, seed])Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- fit(X, y, y_var=None, method='L-BFGS-B')
Fits the Local Epistasis Regression (LER) model hyperparameters to the provided data.
This method infers the optimal regularization parameters a via kernel alignment of the residuals of a P-1 order interaction model fit via maximum likelihood. Thus, we infer the a values and lambda_U that best match the empirical covariance.
- Parameters:
- Xarray-like of shape (n_obs,)
Array containing the genotypes for which observations are provided in y.
- yarray-like of shape (n_obs,)
Array containing the observed phenotypes corresponding to the genotypes in X.
- y_vararray-like of shape (n_obs,), optional
Array containing the empirical or experimental variance for the measurements in y. If not provided, it is assumed to be uniform or unknown.
- methodstr, optional
Optimization method to use during kernel alignment. Default is ‘L-BFGS-B’.
- get_a_values(position_labels=None)
Return a DataFrame of interaction-specific regularization parameters.
- Parameters:
- position_labelsarray-like of shape (seq_length,), optional
Labels for sequence positions (ints or strings). If None, defaults to np.arange(self.seq_length).
- Returns:
- pandas.DataFrame
Rows correspond to interactions U. Columns include: - ‘site{i}’ (for i=0..|U|-1): labels of positions in U - ‘a_U’: regularization parameter for interaction U - ‘interaction_strength’: 1.0 / a_U
- get_empirical_pred_correlations_df()
Compute empirical and predicted correlations for pairs of sequences differing at all possible combinations of sites U.
- Returns:
- pandas.DataFrame
DataFrame indexed by concatenated site labels with columns - d: number of sites in the set - n: number of observations for that set - emp_cor: empirical centered autocovariance normalized by the zero-lag value - pred_cor: predicted autocovariance (from the current aligner) normalized likewise - d_jittered: jittered d useful for plotting
- get_lambda_U_values(position_labels=None)
Return a DataFrame of interaction-specific lambda values for interactions U with order lower than P.
- Parameters:
- position_labelsarray-like of shape (seq_length,), optional
Labels for sequence positions (ints or strings). If None, defaults to np.arange(self.seq_length).
- Returns:
- pandas.DataFrame
Rows correspond to interactions U (only those with order < P). Columns: - ‘U’: comma-separated position labels in U - ‘k’: number of sites in U - ‘lambda_U’: regularization parameter for interaction U
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.
- Parameters:
- contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)
A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.
- Returns:
- contrastspd.DataFrame of shape (n_contrasts, 5)
A DataFrame summarizing the posterior distribution for each contrast. The columns include:
estimate: Posterior mean for each contrast.std: Posterior standard deviation for each contrast.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.
- predict(X_pred=None, calc_variance=False)
Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
- Parameters:
- X_predarray-like of shape (n_genotypes,), optional
Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
- calc_variancebool, optional, default=False
If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.
- Returns:
- predpd.DataFrame of shape (n_genotypes, n_columns)
A DataFrame containing the predicted phenotypes for each input genotype in the column
f. Ifcalc_variance=True, additional columns are included:f_var: Posterior variance for each genotype.f_std: Posterior standard deviation for each genotype.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.
The genotype labels are used as the row index.
Notes
The MAP estimate is computed using the posterior mean.
If calc_variance is enabled, the credible intervals are calculated as mean ± 2 * standard deviation.
Examples
Predict phenotypes for the entire genotype space:
>>> pred = model.predict()
Predict phenotypes for specific genotypes with variance:
>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)
- sample_prior()
Generate a sample from the prior distribution.
This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.
- Returns:
- f: numpy.ndarray
A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.
- simulate(X=None, y_var=0.0, p_missing=0, seed=None)
Simulates data under the specified prior allowing for the addition of experimental Gaussian noise and the random omission of genotypes in the output data.
- Xarray-like, optional
Input sequences for which to generate the measurements y. If None, genotypes are randomly selected based on the missing probability p_missing. Default is None.
- y_varfloat or array-like, optional
Standard deviation of the experimental noise to be added to the variance components. If a float is provided, it is broadcast to match the shape of X. If an array is provided, its shape must match either the number of genotypes or the shape of X. Default is 0..
- p_missingfloat, optional
Probability (between 0 and 1) of randomly omitting genotypes in the simulated output data. Default is 0.
- seedfloat, optional
Random seed for reproducibility. Default is None.
- Returns:
- farray-like
The true simulated measurements without experimental noise.
- Xarray-like
The input sequences used for the simulation.
- yarray-like
The simulated measurements with experimental noise added.
- y_vararray-like
The standard deviation of the experimental noise for each input sequence.
- Raises:
- ValueError
If the shape of y_var does not match the expected dimensions.
Examples
Simulate data with default parameters:
>>> f, X, y, y_var = gp.simulate()
Simulate data with custom noise and missing probability:
>>> f, X, y, y_var = gp.simulate(y_var=0.1, p_missing=0.2, seed=42)
Density estimation
- class gpmap.inference.SeqDEFT(n_alleles=None, seq_length=None, alphabet_type='custom', genotypes=None, P=2, a=None, num_reg=20, nfolds=5, lambdas_P_inv=None, a_resolution=0.1, max_a_max=1000000000000.0, fac_max=0.1, fac_min=1e-06, optimization_opts={}, maxiter=10000, gtol=1e-06, ftol=1e-08)
Model for inference of a genotype-phenotype map from observations of sequences.
Sequence Density Estimation using Field Theory (SeqDEFT) model for inferring a complete sequence probability distribution under a Gaussian Process prior. The prior is parameterized by the variance of local epistatic coefficients of order P.
- Parameters:
- Pint
The order of local interaction coefficients penalized under the prior. For example, P=2 penalizes local pairwise interactions across all possible faces of the Hamming graph, while P=3 penalizes local 3-way interactions across all possible cubes.
- afloat, optional, default=None
A parameter related to the inverse variance of the P-order epistatic coefficients being penalized. Larger values induce stronger penalization, approximating the Maximum-Entropy model of order P-1. If a=None, the optimal value of a is determined through cross-validation.
- num_regint, optional, default=20
The number of a values to evaluate during the cross-validation procedure.
- nfoldsint, optional, default=5
The number of folds to use in the cross-validation procedure.
- lambdas_P_invarray-like, optional, default=None
The inverse of the variance components for the first P orders of interaction. If provided, these values are used to regularize the kernel basis.
- a_resolutionfloat, optional, default=0.1
The resolution for determining the range of a values during cross-validation.
- max_a_maxfloat, optional, default=1e12
The maximum value of a to consider during cross-validation.
- fac_maxfloat, optional, default=0.1
A factor to determine the maximum value of a relative to the number of P-order faces in the Hamming graph.
- fac_minfloat, optional, default=1e-6
A factor to determine the minimum value of a relative to the number of P-order faces in the Hamming graph.
- optimization_optsdict, optional, default={}
A dictionary of options for the optimization procedure used to calculate the maximum entropy model.
- maxiterint, optional, default=10000
The maximum number of iterations for the optimization procedure.
- gtolfloat, optional, default=1e-6
The gradient tolerance for the optimization procedure.
- ftolfloat, optional, default=1e-8
The function tolerance for the optimization procedure.
Methods
fit(X[, y, baseline_phi, baseline_X, ...])Infers the SeqDEFT model hyperparameter a from the provided data.
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
predict([X_pred, calc_variance])Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
Generate a sample from the prior distribution.
simulate(N[, seed])Simulates data under the specified a penalization for local P-epistatic coefficients.
- fit(X, y=None, baseline_phi=None, baseline_X=None, positions=None, phylo_correction=False, adjust_freqs=False, allele_freqs=None)
Infers the SeqDEFT model hyperparameter a from the provided data.
This method determines the optimal regularization parameter a by evaluating the log-likelihood of held-out sequences under a grid search for a in cross-validation settings.
- Parameters:
- Xarray-like of shape (n_obs,)
Array containing the observed sequences.
- yarray-like of shape (n_obs,)
Array containing the weights for each observed sequence. By default, each sequence is assigned a weight of 1. These weights can be computed using phylogenetic correction.
- baseline_Xarray-like of shape (n_genotypes,), optional
Array containing the sequences associated with baseline_phi.
- baseline_phiarray-like of shape (n_genotypes,), optional
Array containing the baseline values (baseline_phi) to include in the model.
- positionsarray-like of shape (n_pos,), optional
If provided, subsequences at these positions in the input sequences will be used as input.
- phylo_correctionbool, optional, default=False
Whether to apply phylogenetic correction using the full-length sequences.
- adjust_freqsbool, optional, default=False
Whether to adjust densities by the expected allele frequencies in the full-length sequences.
- allele_freqsdict or codon_table, optional
Dictionary containing the expected allele frequencies for each allele in the set of possible sequences, or a codon table to generate expected amino acid frequencies. If None, these frequencies will be calculated from the full-length observed sequences.
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specified Gaussian Process prior.
This method calculates the posterior mean, standard deviation, 95% credible intervals, and the posterior probability for each linear combination of genotypes defined in the contrast matrix.
- Parameters:
- contrast_matrixpd.DataFrame of shape (n_genotypes, n_contrasts)
A DataFrame where each column represents a linear combination of genotypes (contrast) for which the posterior distribution is to be computed. The index should correspond to the genotypes.
- Returns:
- contrastspd.DataFrame of shape (n_contrasts, 5)
A DataFrame summarizing the posterior distribution for each contrast. The columns include:
estimate: Posterior mean for each contrast.std: Posterior standard deviation for each contrast.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.p(|x|>0): Posterior probability that the absolute value of the contrast is greater than 0.
- predict(X_pred=None, calc_variance=False)
Compute the Maximum a Posteriori (MAP) estimate of the phenotype for the specified genotypes or the entire genotype space.
- Parameters:
- X_predarray-like of shape (n_genotypes,), optional
Array containing the genotypes for which the phenotype predictions are desired. If X_pred is None, predictions are computed for the entire sequence space.
- calc_variancebool, optional, default=False
If True, the posterior variances and standard deviations for each genotype are also computed and included in the output.
- Returns:
- predpd.DataFrame of shape (n_genotypes, n_columns)
A DataFrame containing the predicted phenotypes for each input genotype in the column
f. Ifcalc_variance=True, additional columns are included:f_var: Posterior variance for each genotype.f_std: Posterior standard deviation for each genotype.ci_95_lower: Lower bound of the 95% credible interval.ci_95_upper: Upper bound of the 95% credible interval.
The genotype labels are used as the row index.
If neither
X_prednorcalc_varianceare provided, the output DataFrame includes additional columns:freq: Empirical frequencies of the genotypes.Q_star: Estimated genotype probabilities.
Examples
Predict phenotypes for the entire genotype space:
>>> pred = model.predict()
Predict phenotypes for specific genotypes with variance:
>>> pred = model.predict(X_pred=["AAA", "AAC"], calc_variance=True)
- sample_prior()
Generate a sample from the prior distribution.
This method samples from the prior distribution by drawing random values from a standard normal distribution and transforming them using the square root of the covariance matrix. The resulting sample represents a realization of the prior distribution over genotypes.
- Returns:
- f: numpy.ndarray
A 1D array of shape (n_genotypes,) representing a sample from the prior distribution. Each element corresponds to a genotype’s value drawn from the prior.
- simulate(N, seed=None)
Simulates data under the specified a penalization for local P-epistatic coefficients.
- Parameters:
- Nint
Number of total sequences to sample.
- seedint, optional (default=None)
Random seed to use for simulation.
- Returns:
- phiarray-like of shape (N,)
Vector containing the true phi values from which samples were generated.
- Xarray-like of shape (N,)
Vector containing the sampled sequences from the probability distribution.
Examples
>>> model = SeqDEFT(n_alleles=4, seq_length=5, P=2, a=1.0) >>> phi, X = model.simulate(N=100, seed=42)
Summary statistics
Experimental data
- class gpmap.summary.GPDataSummarizer(seq_length: int | None = None, alphabet: List[str] | None = None, alphabet_type: str = 'custom', genotypes: ndarray | None = None, X: ndarray | None = None, y: ndarray | None = None, y_var: ndarray | None = None)
Class for computing low-level descriptors of genotype-phenotype data (observed experimental data sampled from the full sequence space).
This class extends SequenceSpaceRelatedObject and provides convenience routines to store observed data and compute covariance and local epistatic summaries using operators defined for the full sequence space. Unlike GPmapSummarizer which operates on a complete genotype- phenotype map, GPDataSummarizer works with a (possibly sparse) dataset of genotypes and corresponding phenotypes.
- Parameters:
- seq_lengthint, optional
Number of sites in the sequence (sequence length). Required unless provided via genotypes.
- alphabetlist of str, optional
Alphabet for each site (list of characters). Required unless provided via genotypes or inferred.
- alphabet_typestr, default “custom”
Type of alphabet (keeps compatibility with SequenceSpaceRelatedObject).
- genotypesnp.ndarray, optional
Array of genotype strings (one per observation). If provided, seq_length and alphabet may be inferred from these.
- Xnp.ndarray, optional
Design / indicator matrix mapping observed genotypes to the full sequence-space basis. Shape (n_obs, n_full_genotypes).
- ynp.ndarray, optional
Observed phenotype values corresponding to rows of X. Shape (n_obs,).
- y_varnp.ndarray, optional
Measurement variances for each observation. If None, zeros are used.
Methods
calc_covariance_U_sites([centered])Compute empirical auto-covariance function depending on the combination of subsets U at which pairs of genotypes differ.
calc_covariance_distance([centered])Compute empirical auto-covariance function depending on the Hamming distance between pairs of genotypes.
- calc_covariance_U_sites(centered: bool = False)
Compute empirical auto-covariance function depending on the combination of subsets U at which pairs of genotypes differ.
- Parameters:
- centeredbool, optional
If True, compute covariances using centered phenotypes (y - mean) and do not add back the phenotype mean square. If False (default), add the phenotype mean square to produce raw (uncentered) second-moment estimates.
- Returns:
- covnp.ndarray, shape (2 ** self.seq_length,)
Covariance (or mean product) estimates for each subset
Uin the same order returned byself.get_Us().- nsnp.ndarray, shape (2 ** self.seq_length,)
Number of observed genotype pairs that differ exactly on the sites specified by each subset
U.
- calc_covariance_distance(centered: bool = False)
Compute empirical auto-covariance function depending on the Hamming distance between pairs of genotypes.
- Parameters:
- centeredbool, optional
If True, return covariances computed on centered phenotypes (y - mean) and do not add back the phenotype mean square. If False (default), the phenotype mean square is added to produce raw (uncentered) second-moment estimates.
- Returns:
- covnp.ndarray, shape (seq_length + 1,)
Covariance (or mean product) estimates for each Hamming distance d = 0..seq_length. Note: cov[0] is adjusted to remove the mean measurement variance (self.y_var_mean).
- nsnp.ndarray, shape (seq_length + 1,)
Number of pairs of sequences at each distance class.
Complete landscapes
- class gpmap.summary.GPmapSummarizer(n_alleles: int, seq_length: int, f=None)
Class for computing low-level descriptors of a complete genotype-phenotype map.
- Parameters:
- n_allelesint
Number of alleles per site.
- seq_lengthint
Number of sites in the sequence (sequence length).
- farray-like, optional
Phenotype values for every possible genotype, ordered lexicographically. If None, the phenotype vector can be provided later when calling instance methods.
Methods
Compute root mean squared P-way epistatic coefficient for each possible all possible combinations of P mutations in the complete genotype-phenotype map.
Compute variance components contributed by interactions between every possible subset of sites U.
Compute variance components contributed by interactions of each order k.
Compute root mean squared epistatic coefficient of order P across all possible combinations of P mutations in the complete genotype-phenotype map.
calc_site_pairs_variance_perc(V_U_vcs[, min_k])Compute the percentage variance explained by genetic interactions of at least order
min_kinvolving every possible pair of sites from previously computed V_U variance components.calc_sites_variance_perc(V_U_vcs)Compute the percentage variance explained by genetic interactions of every possible order involving every possible site from previously computed V_U variance components.
- calc_U_root_mean_squared_epistatic_coeffs(P=2, f=None)
Compute root mean squared P-way epistatic coefficient for each possible all possible combinations of P mutations in the complete genotype-phenotype map.
- Parameters:
- Pint
The order of local epistatic coefficients to compute e.g. P=1 reflects mutational effects, P=2 epistatic coefficients, etc.
- farray-like, optional
Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.
- Returns:
- rmsecpd.DataFrame
Root mean squared epistatic coefficient of order for each combination of sites U.
- calc_V_U_variance_components(f=None)
Compute variance components contributed by interactions between every possible subset of sites U.
Calculates the total variance in the phenotype vector f explained by genetic interactions involving all subsets of sites U. For each U this method projects f onto the corresponding subspace using VUProjectionOperator and computes its norm.
- Parameters:
- farray-like, optional
Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.
- Returns:
- V_U_vcspd.DataFrame
DataFrame with shape (seq_length, 5) and columns:
U: subset of sitesk: interaction order (1..seq_length)variance: total variance explained by order kvariance_perc: percentage of total variance explained by kvariance_perc_cum: cumulative percentage up to and including k
Notes
Percentages are scaled so that the sum of
variance_percis 100.
- calc_V_k_variance_components(f=None)
Compute variance components contributed by interactions of each order k.
Calculates the total variance in the phenotype vector f explained by genetic interactions of order k for k = 1..seq_length. For each k this method projects f onto the corresponding subspace using ProjectionOperator and computes its norm.
- Parameters:
- farray-like, optional
Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.
- Returns:
- V_k_vcspd.DataFrame
DataFrame with shape (seq_length, 4) and columns:
k: interaction order (1..seq_length)variance: total variance explained by order kvariance_perc: percentage of total variance explained by kvariance_perc_cum: cumulative percentage up to and including k
Notes
Percentages are scaled so that the sum of
variance_percis 100.
- calc_root_mean_squared_epistatic_coeff(P=2, f=None)
Compute root mean squared epistatic coefficient of order P across all possible combinations of P mutations in the complete genotype-phenotype map.
- Parameters:
- Pint
The order of local epistatic coefficients to compute e.g. P=1 reflects mutational effects, P=2 epistatic coefficients, etc.
- farray-like, optional
Phenotype values for every genotype in lexicographic order. If None, the instance attribute self.f is used. If both are None, a ValueError is raised.
- Returns:
- rmsecfloat
Root mean squared epistatic coefficient of order P
- calc_site_pairs_variance_perc(V_U_vcs, min_k=2)
Compute the percentage variance explained by genetic interactions of at least order
min_kinvolving every possible pair of sites from previously computed V_U variance components.- Parameters:
- V_U_vcspd.DataFrame
DataFrame with shape (seq_length, 5) and columns:
U: subset of sitesk: interaction order (1..seq_length)variance: total variance explained by order kvariance_perc: percentage of total variance explained by kvariance_perc_cum: cumulative percentage up to and including k
This DataFrame is the output of
calc_V_U_variance_components.- min_kint, optional
Minimum interaction order to include. Defaults to 2. Must satisfy 1 <=
min_k<=self.seq_length.
- Returns:
- vcs_percpd.DataFrame
Table with columns
site1,site2,variance, andvariance_percthat reports the percentage variance contributed by interactions of order >=min_kfor each site pair.
- Raises:
- ValueError
If
min_kis outside the range 1..``self.seq_length`` or ifV_U_vcsreferences unexpected sites.
Notes
Percentages are scaled so that the sum of
variance_percis 100.
- calc_sites_variance_perc(V_U_vcs)
Compute the percentage variance explained by genetic interactions of every possible order involving every possible site from previously computed V_U variance components.
- Parameters:
- V_U_vcspd.DataFrame
DataFrame with shape (seq_length, 5) and columns:
U: subset of sitesk: interaction order (1..seq_length)variance: total variance explained by order kvariance_perc: percentage of total variance explained by kvariance_perc_cum: cumulative percentage up to and including k
This DataFrame is the output of
calc_V_U_variance_components.
- Returns:
- vcs_percpd.DataFrame of shape (seq_length, seq_length)
Table where the rows index interaction order (1..seq_length) and the columns index each site position. Each entry reports the percentage of the total variance explained by components of order
kthat involve sitep.
- Raises:
- ValueError
If
V_U_vcsreferences sites outsideself.positions.
Notes
Percentages are scaled so that the sum of
variance_percis 100.
Visualization
Discrete Spaces
- class gpmap.space.DiscreteSpace(adjacency_matrix, y=None, state_labels=None)
Class to define an arbitrary discrete space characterized by the connectivity between different states and optionally by a scalar value (e.g. fitness or energy) associated with each state.
- Parameters:
- adjacency_matrixscipy.sparse.csr_matrix of shape (n_states, n_states)
Sparse matrix representing the adjacency relationships between states. The (i, j) entry contains a 1 if states i and j are connected, and 0 otherwise.
- yarray-like of shape (n_states,), optional
Function value associated with each state.
- state_labelsarray-like of shape (n_states,), optional
Labels for the states in the discrete space.
- Attributes:
- n_statesint
Number of states in the discrete space.
- state_labelsarray-like of shape (n_states,)
Labels for the states in the discrete space.
- state_idxspd.Series of shape (n_states,)
A pandas Series mapping state labels to their corresponding indices. The index of the Series is state_labels, allowing quick lookup of indices for a given set of state labels.
is_regularboolAttribute characterizing whether the space is regular, this is, every
Methods
Generate a DataFrame representing the edges of the adjacency graph.
Retrieve pairs of indices representing connected states in the DiscreteSpace.
get_neighbors(states[, max_distance])Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
get_state_idxs(states)Get the indexes for the provided state labels.
- get_edges_df()
Generate a DataFrame representing the edges of the adjacency graph.
This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.
- Returns:
- edges_dfpd.DataFrame
A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.
- get_neighbor_pairs()
Retrieve pairs of indices representing connected states in the DiscreteSpace.
- Returns:
- tuple of np.ndarray
Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.
- get_neighbors(states, max_distance=1)
Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
- Parameters:
- statesarray-like of shape (state_number,)
A list or numpy array of state labels from which to find neighbors.
- max_distanceint, optional, default=1
The maximum distance within which neighbors of the provided states will be included.
- Returns:
- neighbor_statesnp.array
An array containing the state labels of all unique neighbors within the specified distance from the input states.
- get_state_idxs(states)
Get the indexes for the provided state labels.
- Parameters:
- statesarray-like
A list or array of state labels for which the indexes are to be retrieved.
- Returns:
- pandas.Series
A pandas Series containing the indexes corresponding to the provided state labels.
- class gpmap.space.GridSpace(length, y=None, ndim=2)
N-dimensional grid discrete space.
A discrete space formed by the Cartesian product of one-dimensional spaces of ordered n-states, represented by a line graph.
- Parameters:
- length: int or array-like
The number of states across each dimension of the grid. If an integer is provided, all dimensions of the grid will have the same length. If an array-like of lengths is provided, they will be used to form a grid with the specified dimensions, and the ndim argument will be ignored.
- ndim: int
The number of dimensions in the grid when a single length value is provided.
- y: array-like of shape (length ** ndim,) or None
Phenotypic values associated with each possible state.
Methods
Generate a DataFrame representing the edges of the adjacency graph.
Retrieve pairs of indices representing connected states in the DiscreteSpace.
get_neighbors(states[, max_distance])Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
get_state_idxs(states)Get the indexes for the provided state labels.
set_peaks(positions[, sigma])Set peaks in the grid space by assigning function values based on distances from specified positions.
- get_edges_df()
Generate a DataFrame representing the edges of the adjacency graph.
This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.
- Returns:
- edges_dfpd.DataFrame
A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.
- get_neighbor_pairs()
Retrieve pairs of indices representing connected states in the DiscreteSpace.
- Returns:
- tuple of np.ndarray
Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.
- get_neighbors(states, max_distance=1)
Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
- Parameters:
- statesarray-like of shape (state_number,)
A list or numpy array of state labels from which to find neighbors.
- max_distanceint, optional, default=1
The maximum distance within which neighbors of the provided states will be included.
- Returns:
- neighbor_statesnp.array
An array containing the state labels of all unique neighbors within the specified distance from the input states.
- get_state_idxs(states)
Get the indexes for the provided state labels.
- Parameters:
- statesarray-like
A list or array of state labels for which the indexes are to be retrieved.
- Returns:
- pandas.Series
A pandas Series containing the indexes corresponding to the provided state labels.
- set_peaks(positions, sigma=1)
Set peaks in the grid space by assigning function values based on distances from specified positions.
- Parameters:
- positionsarray-like of shape (n_peaks, ndim)
Coordinates of the peaks in the grid space. Each row represents the position of a peak in the n-dimensional space.
- sigmafloat, optional, default=1
Controls the spread of the peaks. Smaller values result in sharper peaks, while larger values create broader peaks.
- class gpmap.space.CodonSpace(allowed_aminoacids, codon_table='Standard', add_variation=False, seed=None)
Generate a 3-nucleotide sequence space based on allowed amino acids.
This class creates a nucleotide sequence space corresponding to the provided amino acid constraints using a codon table. Optionally, random variation can be added to the nucleotide space.
- Parameters:
- allowed_aminoacidsstr or array-like
A single amino acid (as a string) or a list/array of allowed amino acids.
- codon_tablestr, optional
The codon table to use for mapping amino acids to nucleotides. Default is “Standard”.
- add_variationbool, optional
If True, adds random variation to the nucleotide space. Default is False.
- seedint, optional
Seed for the random number generator, used when add_variation is True. Default is None.
Methods
Generate a DataFrame representing the edges of the adjacency graph.
Retrieve pairs of indices representing connected states in the DiscreteSpace.
get_neighbors(states[, max_distance])Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
get_state_idxs(states)Get the indexes for the provided state labels.
- get_edges_df()
Generate a DataFrame representing the edges of the adjacency graph.
This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.
- Returns:
- edges_dfpd.DataFrame
A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.
- get_neighbor_pairs()
Retrieve pairs of indices representing connected states in the DiscreteSpace.
- Returns:
- tuple of np.ndarray
Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.
- get_neighbors(states, max_distance=1)
Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
- Parameters:
- statesarray-like of shape (state_number,)
A list or numpy array of state labels from which to find neighbors.
- max_distanceint, optional, default=1
The maximum distance within which neighbors of the provided states will be included.
- Returns:
- neighbor_statesnp.array
An array containing the state labels of all unique neighbors within the specified distance from the input states.
- get_state_idxs(states)
Get the indexes for the provided state labels.
- Parameters:
- statesarray-like
A list or array of state labels for which the indexes are to be retrieved.
- Returns:
- pandas.Series
A pandas Series containing the indexes corresponding to the provided state labels.
- class gpmap.space.SequenceSpace(X=None, y=None, seq_length=None, n_alleles=None, alphabet_type='dna', alphabet=None, stop_y=None)
Space of all possible sequences of certain length.
Class for creating a Sequence space characterized by having sequences as states. States are connected in the discrete space if they differ by a single position in the sequence. It can be created in two different ways:
From a set of sequences and function values (X, y).
By specifying the properties of the sequence space (alphabet, sequence length, number of alleles per site, and type of alphabet).
- Parameters:
- Xarray-like of shape (n_genotypes,), optional
Sequences to use as state labels of the discrete sequence space.
- yarray-like of shape (n_genotypes,), optional
Quantitative phenotype or fitness associated with each genotype.
- seq_lengthint, optional
Length of the sequences in the sequence space. If not provided, it will be inferred from alphabet or n_alleles.
- n_alleleslist of int, optional
List containing the number of alleles present at each site in the sequence space. This can only be specified for alphabet_type=’custom’.
- alphabet_typestr, default=’dna’
Type of sequence. Options are {‘dna’, ‘rna’, ‘protein’, ‘custom’}.
- alphabetlist of lists, optional
A list where each element is itself a list containing the different alleles allowed at each site. The number and type of alleles can vary across sites.
- stop_yfloat, optional
Value of the function assigned to protein sequences with an in-frame stop codon. If provided, the protein alphabet will be extended to include * for stop codons.
- Attributes:
- n_genotypesint
Number of states in the complete sequence space.
- genotypesarray-like of shape (n_genotypes,)
Genotype labels in the sequence space.
- adjacency_matrixscipy.sparse.csr_matrix of shape (n_genotypes, n_genotypes)
Sparse matrix representing the adjacency relationships between genotypes. The (i, j) entry contains a 1 if genotypes i and j differ by a single mutation, and 0 otherwise.
- yarray-like of shape (n_genotypes,), optional
Quantitative phenotype or fitness associated with each genotype.
is_regularboolAttribute characterizing whether the space is regular, this is, every
Methods
Generate a DataFrame representing the edges of the adjacency graph.
Retrieve pairs of indices representing connected states in the DiscreteSpace.
get_neighbors(states[, max_distance])Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
get_single_mutant_matrix(sequence[, center])Calculate the effects of single point mutations from a focal sequence.
get_state_idxs(states)Get the indexes for the provided state labels.
Recalculate the adjacency matrix to allow only codon-compatible transitions in a protein sequence space.
to_nucleotide_space([codon_table, alphabet_type])Convert a protein sequence space into a nucleotide sequence space.
- get_edges_df()
Generate a DataFrame representing the edges of the adjacency graph.
This method retrieves pairs of neighboring nodes from the adjacency matrix and constructs a DataFrame where each row represents an edge between two nodes.
- Returns:
- edges_dfpd.DataFrame
A DataFrame with two columns: - ‘i’: The source node of the edge. - ‘j’: The target node of the edge.
- get_neighbor_pairs()
Retrieve pairs of indices representing connected states in the DiscreteSpace.
- Returns:
- tuple of np.ndarray
Two arrays of indices, where the first array contains the source indices and the second array contains the target indices of the connections.
- get_neighbors(states, max_distance=1)
Retrieve the unique state labels corresponding to the neighbors of the provided states within a specified maximum distance.
- Parameters:
- statesarray-like of shape (state_number,)
A list or numpy array of state labels from which to find neighbors.
- max_distanceint, optional, default=1
The maximum distance within which neighbors of the provided states will be included.
- Returns:
- neighbor_statesnp.array
An array containing the state labels of all unique neighbors within the specified distance from the input states.
- get_single_mutant_matrix(sequence, center=False)
Calculate the effects of single point mutations from a focal sequence.
- Parameters:
- sequencestr
The sequence from which to compute all single point mutant effects.
- centerbool, optional, default=False
If True, the results will be centered by position, ensuring that the mean of allelic effects at each position is 0. If False, the focal sequence will have a value of 0, and the results will represent mutational effects relative to it.
- Returns:
- outputpd.DataFrame of shape (seq_length, total_alleles)
A DataFrame containing the mutational or allelic effects for each allele across all sequence positions.
- get_state_idxs(states)
Get the indexes for the provided state labels.
- Parameters:
- statesarray-like
A list or array of state labels for which the indexes are to be retrieved.
- Returns:
- pandas.Series
A pandas Series containing the indexes corresponding to the provided state labels.
- remove_codon_incompatible_transitions(codon_table='Standard')
Recalculate the adjacency matrix to allow only codon-compatible transitions in a protein sequence space.
This method updates the adjacency matrix of the sequence space to ensure that transitions between states are compatible with the specified codon table. Only transitions that result in valid amino acid substitutions according to the codon table will be allowed.
- Parameters:
- codon_tablestr or Bio.Data.CodonTable
The NCBI code for an existing genetic code or a custom CodonTable object used to translate nucleotide sequences into proteins.
- to_nucleotide_space(codon_table='Standard', alphabet_type='dna')
Convert a protein sequence space into a nucleotide sequence space.
This method transforms a protein sequence space into a nucleotide sequence space using a specified codon table for translation. The resulting nucleotide space will have 4 alleles per site and 3 times the number of sites as the original protein space. It assumes that the function associated with each nucleotide sequence depends only on the protein sequence it encodes.
- Parameters:
- codon_tablestr or Bio.Data.CodonTable
The NCBI code for an existing genetic code or a custom CodonTable object used to translate nucleotide sequences into proteins.
- alphabet_typestr, optional, default=’dna’
The type of nucleotide sequence to use in the resulting space. Must be one of {‘dna’, ‘rna’}.
- Returns:
- SequenceSpace
A nucleotide sequence space with the specified properties.
Random walks
- class gpmap.randwalk.WMWalk(space, log=None, Ns=None)
Class for Weak Mutation Random Walk on a SequenceSpace. This is a time-reversible continuous-time Markov Chain where the transition rates are determined by the differences in fitness between two states, scaled by the effective population size Ns.
The transition rate matrix Q(i, j) is defined as:
\[\begin{split}Q(i, j) = \begin{cases} M(i, j)\frac{S(i, j)}{1 - e^{S(i, j)}} & \text{if $i$ and $j$ are neighbors}\\ -\sum_{k\neq i} Q(i, k) & \text{if } i=j \\ 0 & \text{Otherwise}, \end{cases}\end{split}\]where \(M(i, j)\) is the time-reversible neutral mutation rate between \(i\) and \(j\) and \(S(i, j)\) is the scaled fitness difference between \(i\) and \(j\), typically defined as \(S(i, j) = Ns(f_j - f_i)\), where \(f_i\) is the phenotype for state \(i\)
Methods
calc_rate_matrix([Ns, neutral_stat_freqs, ...])Computes and stores the rate matrix for the random walk in the discrete space.
calc_stationary_frequencies([Ns, ...])Calculates the stationary frequencies of states under the given evolutionary model.
calc_visualization([Ns, mean_function, ...])Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk.
set_Ns([Ns, mean_function, ...])Sets the scaled effective population size (Ns) or calculates it based on the desired mean function or percentile of the function values.
write_tables(prefix[, write_edges, ...])Write the output of the visualization to files with a common prefix.
- calc_rate_matrix(Ns=None, neutral_stat_freqs=None, neutral_exchange_rates=None)
Computes and stores the rate matrix for the random walk in the discrete space.
- Parameters:
- Nsfloat, optional
Scaled effective population size for the evolutionary model. If not provided, the value of self.Ns will be used.
- neutral_stat_freqsarray-like of shape (n_states,), optional
Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.
- neutral_exchange_ratesscipy.sparse.csr.csr_matrix of shape (n_states, n_states), optional
Sparse matrix containing the neutral exchange rates for the entire sequence space. If not provided, uniform mutational dynamics are assumed.
Notes
The resulting rate matrix is stored in the rate_matrix attribute.
The method also calculates the symmetrized rate matrix as an intermediate step.
- calc_stationary_frequencies(Ns=None, neutral_stat_freqs=None)
Calculates the stationary frequencies of states under the given evolutionary model.
- Parameters:
- Nsfloat, optional
Scaled effective population size for the evolutionary model. If not provided, the value of self.Ns will be used.
- neutral_stat_freqsarray-like of shape (n_states,), optional
Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.
- Returns:
- stationary_freqsarray-like of shape (n_states,)
The stationary frequencies of states under the given evolutionary model.
- calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=10, neutral_exchange_rates=None, neutral_stat_freqs=None, tol=1e-12)
Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist of the right eigenvectors of the associated rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time.
- Parameters:
- Nsfloat, optional
Scaled effective population size to use in the underlying evolutionary model. If not provided, it will be derived from mean_function or mean_function_perc.
- mean_functionfloat, optional
Mean function at stationarity to derive the associated Ns. Either this or mean_function_perc must be provided if Ns is not specified.
- mean_function_percfloat, optional
Percentile that the mean function at stationarity takes within the distribution of function values along sequence space. For example, if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values. Either this or mean_function must be provided if Ns is not specified.
- n_componentsint, default=10
Number of eigenvectors or Diffusion axes to calculate.
- neutral_stat_freqsarray-like of shape (n_states,), optional
Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, uniform stationary frequencies are assumed.
- neutral_exchange_ratesscipy.sparse.csr.csr_matrix of shape
(n_states, n_states), optional Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.
- tolfloat, default=1e-12
Tolerance for the eigendecomposition solver. Lower values result in higher precision but may increase computation time.
Notes
The visualization coordinates are stored in self.nodes_df, which includes the scaled eigenvectors, function values, and stationary frequencies for each state.
Relaxation times and decay rates are stored in self.decay_rates_df.
- set_Ns(Ns=None, mean_function=None, mean_function_perc=None, neutral_stat_freqs=None, tol=0.0001)
Sets the scaled effective population size (Ns) or calculates it based on the desired mean function or percentile of the function values.
- Parameters:
- Nsfloat, optional
Scaled effective population size for the evolutionary model. If provided, it will be directly set. Must be non-negative.
- mean_functionfloat, optional
Desired mean function value at stationarity. If provided, Ns will be optimized to achieve this value.
- mean_function_percfloat, optional
Percentile of the function values to use as the desired mean function at stationarity. For example, if set to 98, the mean function will be set to the 98th percentile of the function values.
- neutral_stat_freqsarray-like of shape (n_states,), optional
Genotype stationary frequencies at neutrality to define the time-reversible neutral dynamics. If not provided, the existing neutral_stat_freqs attribute will be used if available.
- tolfloat, optional, default=1e-4
Tolerance for determining whether the mean function is close to the neutral mean function.
- Raises:
- ValueError
If none of Ns, mean_function, or mean_function_perc is provided.
- ValueError
If mean_function_perc is not between 0 and 100.
- ValueError
If mean_function is not between the neutral mean function and the maximum function value.
- write_tables(prefix, write_edges=False, nodes_format='parquet', edges_format='npz')
Write the output of the visualization to files with a common prefix. The output can include up to three different tables, depending on the options provided:
Nodes coordinates: Contains the coordinates for each state, along with the associated function values and stationary frequencies. Stored in either CSV format with the suffix “nodes.csv” or Parquet format with the suffix “nodes.pq”.
Decay rates: Contains the decay rates and relaxation times associated with each component or diffusion axis. Stored in CSV format with the suffix “decay_rates.csv”.
Edges: Contains the adjacency relationships between states. This is not stored by default unless write_edges=True. Since the edges remain unchanged for any visualization on the same SequenceSpace, they only need to be stored once. Stored in either CSV format or the more efficient NPZ format for sparse matrices.
- Parameters:
- prefixstr
Prefix for the filenames used to store the tables.
- write_edgesbool, optional, default=False
Whether to write the adjacency relationships between states (edges) to a file.
- nodes_format{‘parquet’, ‘csv’}, optional, default=’parquet’
Format for storing the nodes information. Parquet is more efficient, but CSV can be used for smaller datasets or when plain text storage is preferred.
- edges_format{‘npz’, ‘csv’}, optional, default=’npz’
Format for storing the edges information. NPZ is more efficient, but CSV can be used for smaller datasets or when plain text storage is preferred.
Plotting
Summary statistics
- gpmap.plot.mpl.plot_correlation_distance(corr, axes, x='d', y='emp_cor')
Plot the correlation as a function of Hamming distance.
This function visualizes the relationship between the Hamming distance and the empirical correlation by plotting the data points and connecting them with a line. It also customizes the axes labels, limits, and ticks for better interpretability.
- gpmap.plot.mpl.plot_correlation_U_sites(corr, axes, x='d_jittered', y='emp_cor')
Plot the correlation values for each distance class corresponding to all possible combinations of sites at which two sequences differ.
Each point represents a distance class. Each distance class corresponds to each of the possible subsets of sites (of any size) and are plotted according to a jittered Hamming distance class on the x-axis.
- Parameters:
- corrpd.DataFrame
DataFrame containing correlation data with at least columns for x and y coordinates, and index values representing sequences.
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the landscape.
- xstr, default=”d_jittered”
Column name in corr for the x-axis coordinates (jittered Hamming distance).
- ystr, default=”emp_cor”
Column name in corr for the y-axis coordinates and correlation values to plot.
- gpmap.plot.mpl.plot_interaction_matrix(matrix, axes, cmap='binary', vmax=None, scale_factor=1e-06, position_labels=None, xlabel='Site 1', ylabel='Site 2', cbar_label='Interaction strength ($a_{ij}$)')
Plots a heatmap of the estimated interaction strengths using local epistasis regression.
This plots the inverse of the regularization parameters for local epistatic interactions involving every pair of sites
- Parameters:
- matrixpd.DataFrame or np.ndarray
2D array or DataFrame containing the matrix values to visualize.
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the matrix.
- cmapstr, default=”binary”
Colormap name to use for the heatmap.
- vmaxfloat, optional
Maximum value for the colormap scale. If None, uses the maximum value in the matrix.
- scale_factorfloat, default=1e-6
Scaling factor to apply to matrix values before plotting. Useful for displaying very large or very small numbers.
- position_labelsarray-like, optional
Labels for the rows and columns. If None, uses the matrix index/columns.
- xlabelstr, default=”Site 1”
Label for the x-axis.
- ylabelstr, default=”Site 2”
Label for the y-axis.
- cbar_labelstr, default=”Value”
Label for the colorbar.
- gpmap.plot.mpl.plot_kth_variance_components(vc, axes, color='black', cum_color='grey', bar_ylim=(0, 50), cum_ylim=(0, 100))
Plot variance components for interaction orders.
- Parameters:
- vcpd.DataFrame
DataFrame containing variance components for a landscape.
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the variance components.
- colorstr, optional, default=”black”
Color for the bars representing variance percentages.
- cum_colorstr, optional, default=”grey”
Color for the cumulative variance line and points.
- bar_ylimtuple, optional, default=(0, 50)
Y-axis limits for the variance percentage bars.
- cum_ylimtuple, optional, default=(0, 100)
Y-axis limits for the cumulative variance line.
- gpmap.plot.mpl.plot_sites_variance_components(axes, sites, cmap='Greys', vmin=0, vmax=None, xlabel='Site', ylabel='Interaction order $k$', cbar_label='% variance explained')
Plot site-level variance components as a heatmap.
- Parameters:
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the heatmap.
- sitespd.DataFrame
DataFrame of shape (n_orders, n_sites) containing variance component values for each interaction order and site.
- cmapstr, default=”Greys”
Colormap name to use for the heatmap.
- vminfloat, default=0
Minimum value for the colormap scale.
- vmaxfloat, default=None
Maximum value for the colormap scale.
- xlabelstr, default=”Site”
Label for the x-axis (sites).
- ylabelstr, default=”Interaction order $k$”
Label for the y-axis (interaction orders).
- cbar_labelstr, default=”% variance explained”
Label for the colorbar.
- gpmap.plot.mpl.plot_site_pairs_variance_components(axes, matrix, cmap='Greys', vmin=0, vmax=60, xlabel='Site 1', ylabel='Site 2', cbar_label='% pairwise and higher-order\nvariance explained')
Plot pairwise site variance components as a heatmap.
- Parameters:
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the heatmap.
- matrixpd.DataFrame
DataFrame of shape (n_sites, n_sites) containing pairwise and higher-order variance component values.
- cmapstr, default=”Greys”
Colormap name to use for the heatmap.
- vminfloat, default=0
Minimum value for the colormap scale.
- vmaxfloat, default=60
Maximum value for the colormap scale.
- xlabelstr, default=”Site 1”
Label for the x-axis.
- ylabelstr, default=”Site 2”
Label for the y-axis.
- cbar_labelstr, default=”% pairwise and higher-ordernvariance explained”
Label for the colorbar.
Matplotlib Backend
- gpmap.plot.mpl.plot_nodes(axes, nodes_df, x='1', y='2', z=None, alpha=1, zorder=2, sort_by=None, sort_ascending=False, color='function', cmap='viridis', cbar=True, cbar_axes=None, cbar_label='Function', cbar_orientation='vertical', vcenter=None, vmax=None, vmin=None, palette='Set1', size=2.5, max_size=40, min_size=1, lw=0, edgecolor='black', legend=True, legend_loc=0, rasterized=False)
Plots the nodes representing the states of the discrete space on the provided coordinates.
- Parameters:
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the nodes or states.
- nodes_dfpandas.DataFrame
DataFrame of shape (n_genotypes, n_components + 2) containing the coordinates in each of the n_components, along with additional columns such as “function” and “stationary_freq”. Additional columns are also allowed.
- xstr, default=’1’
Column in nodes_df to use for plotting the genotypes on the x-axis.
- ystr, default=’2’
Column in nodes_df to use for plotting the genotypes on the y-axis.
- zstr, optional
Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced, provided the axes object supports it.
- alphafloat, default=1
Transparency of markers representing the nodes.
- zorderint, default=2
Order in which the nodes will be rendered relative to other elements. Typically, this should be greater than the zorder used for plotting edges.
- colorstr, default=’grey’
Column name in nodes_df for the values used to color the nodes, or a specific color to use for all nodes.
- vcenterbool, default=False
Whether to center the color scale around the 0 value.
- vmaxfloat, optional
Maximum value for the colormap.
- vminfloat, optional
Minimum value for the colormap.
- cmapstr or matplotlib.colors.Colormap, default=’viridis’
Colormap to use for coloring the nodes based on the color column.
- cbarbool, default=True
Whether to display the colorbar.
- cbar_labelstr, optional
Label for the colorbar associated with the nodes’ color scale.
- cbar_axesmatplotlib.axes.Axes, optional
Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes.
- palettedict, optional
Dictionary mapping categories in the color column to specific colors, if the column represents categorical data.
- sizefloat or str, default=2.5
Size of the markers for the nodes. If a float is provided, it will be used for all nodes. If a string is provided, node sizes will be scaled based on the corresponding column in nodes_df.
- max_sizefloat, default=40
Maximum size for the nodes when scaled.
- min_sizefloat, default=1
Minimum size for the nodes when scaled.
- lwfloat, default=0
Line width of the edges around the markers representing the nodes.
- edgecolorstr, default=’black’
Color of the edges around the markers representing the nodes.
- legendbool, default=True
Whether to display a legend on the plot.
- legend_locint or tuple, default=0
Location of the legend if coloring is based on a categorical variable.
- rasterizedbool, default=False
Whether to rasterize the scatterplot when rendering the plot in vector format.
- gpmap.plot.mpl.plot_edges(axes, nodes_df, edges_df, x='1', y='2', z=None, alpha=0.1, zorder=1, color='grey', cbar=True, cmap='binary', cbar_axes=None, cbar_orientation='vertical', cbar_label='', palette=None, legend=True, legend_loc=0, width=0.5, max_width=1, min_width=0.1, fontsize=None, rasterized=False)
Plots the edges representing connections between states in the discrete space under a particular embedding.
- Parameters:
- axesmatplotlib.axes.Axes
The matplotlib Axes object in which to plot the edges.
- nodes_dfpandas.DataFrame
DataFrame of shape (n_genotypes, n_components + 2) containing the coordinates in each of the n_components, along with additional columns such as “function” and “stationary_freq”. Additional columns are also allowed.
- edges_dfpandas.DataFrame
DataFrame of shape (n_edges, 2) containing the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indices of the pairs of states that are connected.
- xstr, default=’1’
Column in nodes_df to use for plotting the genotypes on the x-axis.
- ystr, default=’2’
Column in nodes_df to use for plotting the genotypes on the y-axis.
- zstr, optional
Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced, provided the axes object supports it.
- alphafloat, default=0.1
Transparency of the lines representing the edges.
- zorderint, default=1
Order in which the edges will be rendered relative to other elements. Typically, this should be smaller than the zorder used for plotting nodes.
- colorstr, default=’grey’
Column name in edges_df for the values used to color the edges, or a specific color to use for all edges.
- cmapstr or matplotlib.colors.Colormap, default=’binary’
Colormap to use for coloring the edges based on the color column.
- widthfloat or str
Width of the lines representing the edges. If a float is provided, it will be used for all edges. If a string is provided, edge widths will be scaled based on the corresponding column in edges_df.
- max_widthfloat, default=1
Maximum width for the edges when scaled.
- min_widthfloat, default=0.1
Minimum width for the edges when scaled.
- rasterizedbool, optional, default=False
Whether to rasterize the plot for better performance with large datasets.
- Returns:
- line_collectionmatplotlib.collections.LineCollection or
- mpl_toolkits.mplot3d.art3d.Line3DCollection
The collection of lines representing the edges.
- gpmap.plot.mpl.plot_visualization(axes, nodes_df, edges_df=None, x='1', y='2', z=None, nodes_alpha=1, nodes_zorder=2, nodes_color='function', nodes_cmap='viridis', nodes_palette=None, nodes_vmin=None, nodes_vmax=None, nodes_vcenter=False, nodes_cbar=True, nodes_cbar_axes=None, nodes_cmap_label='Function', nodes_size=2.5, nodes_min_size=1, nodes_max_size=40, nodes_lw=0, nodes_edgecolor='black', edges_alpha=0.1, edges_zorder=1, edges_color='grey', edges_cmap='binary', edges_palete=None, edges_cbar=False, edges_cbar_axes=None, edges_width=0.5, edges_max_width=1, edges_min_width=0.1, sort_by=None, sort_ascending=True, center_spines=False, add_hist=False, inset_cbar=False, inset_pos=(0.7, 0.7), prev_nodes_df=None, rasterized=False)
Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states if provided.
- Parameters:
- axesmatplotlib.axes.Axes
Matplotlib Axes object in which to plot the edges and nodes.
- nodes_dfpd.DataFrame of shape (n_genotypes, n_variables)
DataFrame containing the coordinates in each of the n_components, in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed.
- edges_dfpd.DataFrame of shape (n_edges, 2), optional
DataFrame containing the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected. If not provided, only nodes will be plotted.
- xstr, optional, default=’1’
Column in nodes_df to use for plotting the genotypes on the x-axis.
- ystr, optional, default=’2’
Column in nodes_df to use for plotting the genotypes on the y-axis.
- zstr, optional, default=None
Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, a 3D plot will be produced.
- nodes_alphafloat, optional, default=1
Transparency of the markers representing the nodes.
- nodes_zorderint, optional, default=2
Order in which the nodes will be rendered relative to other elements.
- nodes_colorstr, optional, default=’function’
Column name in nodes_df for the values used to color the nodes, or a specific color to use for all nodes.
- nodes_cmapstr, optional, default=’viridis’
Colormap to use for coloring the nodes based on the nodes_color column.
- nodes_palettedict, optional, default=None
Dictionary mapping categories in the nodes_color column to specific colors, if the column represents categorical data.
- nodes_vminfloat, optional, default=None
Minimum value for the colormap.
- nodes_vmaxfloat, optional, default=None
Maximum value for the colormap.
- nodes_vcenterbool, optional, default=False
Whether to center the color scale around the 0 value.
- nodes_cbarbool, optional, default=True
Whether to display the colorbar for the nodes.
- nodes_cbar_axesmatplotlib.axes.Axes, optional, default=None
Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes.
- nodes_cmap_labelstr, optional, default=’Function’
Label for the colorbar associated with the nodes’ color scale.
- nodes_sizefloat, optional, default=2.5
Size of the markers for the nodes.
- nodes_min_sizefloat, optional, default=1
Minimum size for the nodes when scaled.
- nodes_max_sizefloat, optional, default=40
Maximum size for the nodes when scaled.
- nodes_lwfloat, optional, default=0
Line width of the edges around the markers representing the nodes.
- nodes_edgecolorstr, optional, default=’black’
Color of the edges around the markers representing the nodes.
- edges_alphafloat, optional, default=0.1
Transparency of the lines representing the edges.
- edges_zorderint, optional, default=1
Order in which the edges will be rendered relative to other elements.
- edges_colorstr, optional, default=’grey’
Column name in edges_df for the values used to color the edges, or a specific color to use for all edges.
- edges_cmapstr, optional, default=’binary’
Colormap to use for coloring the edges based on the edges_color column.
- edges_palettedict, optional, default=None
Dictionary mapping categories in the edges_color column to specific colors, if the column represents categorical data.
- edges_cbarbool, optional, default=False
Whether to display the colorbar for the edges.
- edges_cbar_axesmatplotlib.axes.Axes, optional, default=None
Axes to plot the colorbar for the edges. If not provided, it will be automatically adjusted to the current Axes.
- edges_widthfloat, optional, default=0.5
Width of the lines representing the edges.
- edges_max_widthfloat, optional, default=1
Maximum width for the edges when scaled.
- edges_min_widthfloat, optional, default=0.1
Minimum width for the edges when scaled.
- sort_bystr, optional, default=None
Column in nodes_df to use for sorting the nodes before plotting.
- sort_ascendingbool, optional, default=False
Whether to sort the nodes in ascending order based on the sort_by column.
- center_spinesbool, optional, default=False
Whether to center the spines of the plot at (0, 0).
- add_histbool, optional, default=False
Whether to add a histogram inset showing the distribution of node colors.
- inset_cbarbool, optional, default=False
Whether to add an inset colorbar for the nodes.
- inset_postuple, optional, default=(0.7, 0.7)
Position of the inset colorbar or histogram as a fraction of the Axes.
- prev_nodes_dfpd.DataFrame, optional, default=None
DataFrame containing the previous positions of the nodes. If provided, the current nodes will be aligned to minimize their distance from the previous positions.
- rasterizedbool, optional, default=False
Whether to rasterize the plot for better performance with large datasets.
- gpmap.plot.mpl.plot_relaxation_times(decay_df, axes=None, fpath=None, log_scale=False, neutral_time=None, kwargs={})
Plot relaxation times for calculated components.
- Parameters:
- decay_dfpd.DataFrame
DataFrame with shape (n_components, 3) containing decay rates and associated mean relaxation times for each calculated component.
- axesmatplotlib.axes.Axes, optional
Axes object to plot on. If not provided, a new figure will be created and saved to the path specified by fpath.
- fpathstr, optional
File path to save the plot. If None, the axes argument must be provided for plotting.
- log_scalebool, default=False
Whether to plot relaxation times on a logarithmic scale.
- neutral_timefloat, optional
If provided, a horizontal line representing the neutral process relaxation time will be added to the plot. Useful for selecting relevant dimensions.
- kwargsdict, optional
Additional keyword arguments for axes.plot and axes.scatter, such as color or marker style.
- gpmap.plot.mpl.figure_Ns_grid(rw, x='1', y='2', pmin=0, pmax=0.8, ncol=4, nrow=3, show_edges=True, fpath=None, **kwargs)
Generate a grid of visualizations for different stationary mean functions.
- Parameters:
- rwobject
An object containing the space and nodes information, as well as methods for calculating visualizations.
- xstr, optional
Column in the nodes DataFrame to use for plotting the x-axis. Default is “1”.
- ystr, optional
Column in the nodes DataFrame to use for plotting the y-axis. Default is “2”.
- pminfloat, optional
Minimum proportion of the range to use for calculating mean functions. Default is 0.
- pmaxfloat, optional
Maximum proportion of the range to use for calculating mean functions. Default is 0.8.
- ncolint, optional
Number of columns in the grid. Default is 4.
- nrowint, optional
Number of rows in the grid. Default is 3.
- show_edgesbool, optional
Whether to include edges in the visualization. Default is True.
- fpathstr, optional
File path to save the figure. If None, the figure will not be saved. Default is None.
- gpmap.plot.mpl.figure_allele_grid(nodes_df, edges_df=None, allele_color='orange', background_color='lightgrey', positions=None, position_labels=None, colsize=3, rowsize=2.7, xpos_label=0.05, ypos_label=0.92, fmt='png', fpath=None, **kwargs)
Generate a grid of visualizations for alleles at specific positions.
- Parameters:
- nodes_dfpd.DataFrame
DataFrame containing the nodes’ information, including their coordinates and attributes.
- edges_dfpd.DataFrame, optional
DataFrame containing the edges’ information, including connectivity between nodes. If None, edges will not be plotted. Default is None.
- allele_colorstr, optional
Color used to highlight nodes corresponding to specific alleles. Default is “orange”.
- background_colorstr, optional
Color used for the background nodes. Default is “lightgrey”.
- positionsarray-like, optional
List of positions to visualize. If None, all positions will be used. Default is None.
- position_labelsarray-like, optional
Labels for the positions. If None, positions will be labeled sequentially. Default is None.
- colsizeint, optional
Width of each column in the grid. Default is 3.
- rowsizefloat, optional
Height of each row in the grid. Default is 2.7.
- xpos_labelfloat, optional
Horizontal position of the allele label within each subplot, as a fraction of the axes width. Default is 0.05.
- ypos_labelfloat, optional
Vertical position of the allele label within each subplot, as a fraction of the axes height. Default is 0.92.
- fmtstr, optional
Format to save the figure, e.g., “png” or “pdf”. Default is “png”.
- fpathstr, optional
File path to save the figure. If None, the figure will not be saved. Default is None.
- gpmap.plot.mpl.figure_SeqDEFT_summary(log_Ls, seq_density=None, err_bars='stderr', show_folds=False, legend_loc=1, normalize_logL=True)
Generates a 2-panel figure summarizing the SeqDEFT model results.
The first panel shows how the cross-validated likelihood changes with the
ahyperparameter and highlights the best-selected value for model fitting. If sequence density data is provided, the second panel visualizes the relationship between observed sequence frequencies and estimated densities.- Parameters:
- log_Lspd.DataFrame
A DataFrame of shape (num_a, 3) containing the columns:
a: The hyperparameter values.logL: The log-likelihood values.fold: The cross-validation fold identifiers.
- seq_densitypd.DataFrame, optional
A DataFrame of shape (n_genotypes, >= 2) with the following columns:
frequency: Observed frequencies for each sequence.Q: Estimated densities for each sequence.
If not provided, only a single-panel figure with the cross-validated likelihood curve will be generated.
- err_barsstr, default=’stderr’
Specifies the type of error bars to display:
'sd': Standard deviation across the different folds.'stderr': Standard error of the mean.
- show_foldsbool, default=False
Whether to display the out-of-sample log-likelihoods for the individual folds in the cross-validation procedure.
- legend_locint, default=1
The location of the legend in the plot. Follows matplotlib’s legend location codes.
- normalize_logLbool, default=True
If True, normalizes the log-likelihood values relative to the value at
a = ∞.
- Returns:
- figmatplotlib.figure.Figure
The resulting figure object containing the generated plots.
Plotly Backend
- gpmap.plot.ply.plot_visualization(nodes_df, edges_df=None, x='1', y='2', z=None, nodes_color='function', nodes_size=4, nodes_cmap='viridis', nodes_cmap_label='Function', edges_width=0.5, edges_color='#888', edges_alpha=0.2, text=None, fpath=None)
Creates an interactive plot of a fitness landscape with genotypes as nodes and single point mutations as edges using Plotly.
- Parameters:
- nodes_dfpd.DataFrame
DataFrame containing genotype information. Must include columns for coordinates (e.g., “1”, “2”, “3”), “function”, and optionally other metadata.
- edges_dfpd.DataFrame, optional
DataFrame containing edge connectivity information. Must include columns “i” and “j” for connected node indices.
- xstr, default ‘1’
Column name in nodes_df for the x-axis coordinates.
- ystr, default ‘2’
Column name in nodes_df for the y-axis coordinates.
- zstr, optional
Column name in nodes_df for the z-axis coordinates. If provided, a 3D plot will be generated.
- nodes_colorstr, default ‘function’
Column name in nodes_df for node coloring or a specific color value.
- nodes_sizefloat, default 4
Size of the nodes. Can be a constant or a column name in nodes_df.
- nodes_cmapstr, default ‘viridis’
Colormap for node coloring.
- nodes_cmap_labelstr, default ‘Function’
Label for the colorbar associated with node coloring.
- edges_widthfloat, default 0.5
Width of the edges. Can be a constant or a column name in edges_df.
- edges_colorstr, default ‘#888’
Color of the edges.
- edges_alphafloat, default 0.2
Transparency of the edges.
- textarray-like, optional
Labels for nodes to display on hover. Defaults to nodes_df.index.
- fpathstr, optional
File path to save the interactive plot as an HTML file.
- Returns:
- figplotly.graph_objects.Figure
The generated Plotly figure.
Datashader Backend
- gpmap.plot.ds.plot_nodes(nodes_df, x='1', y='2', color='function', cmap='viridis', vmin=None, vmax=None, size=5, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=True, shade=True, resolution=800, square=False)
Plot nodes with various customization options.
- Parameters:
- nodes_dfpandas.DataFrame
DataFrame containing node data.
- xstr, optional
Column name for the x-axis, by default “1”.
- ystr, optional
Column name for the y-axis, by default “2”.
- colorstr, optional
Column name for the color values, by default “function”.
- cmapstr, optional
Colormap to use for coloring nodes, by default “viridis”.
- vminfloat, optional
Minimum value for color scaling, by default None.
- vmaxfloat, optional
Maximum value for color scaling, by default None.
- sizeint, optional
Size of the nodes, by default 5.
- linewidthint, optional
Line width of the node edges, by default 0.
- edgecolorstr, optional
Color of the node edges, by default “black”.
- sort_bystr, optional
Column name to sort nodes by, by default None.
- sort_ascendingbool, optional
Whether to sort nodes in ascending order, by default True.
- shadebool, optional
Whether to use datashader for rendering, by default True.
- resolutionint, optional
Resolution of the plot, by default 800.
- squarebool, optional
Whether to enforce a square aspect ratio, by default False.
- Returns:
- holoviews.Element
A Holoviews element representing the plotted nodes.
- gpmap.plot.ds.plot_edges(nodes_df, edges_df, x='1', y='2', cmap='grey', width=0.5, alpha=0.2, color='grey', shade=True, resolution=800, square=True)
Plot edges.
- Parameters:
- nodes_dfpandas.DataFrame
DataFrame containing node data.
- edges_dfpandas.DataFrame
DataFrame containing edge data.
- xstr, optional
Column name for the x-axis, by default “1”.
- ystr, optional
Column name for the y-axis, by default “2”.
- cmapstr, optional
Colormap to use for coloring edges, by default “grey”.
- widthfloat, optional
Line width of the edges, by default 0.5.
- alphafloat, optional
Transparency level of the edges, by default 0.2.
- colorstr, optional
Color of the edges, by default “grey”.
- shadebool, optional
Whether to use datashader for rendering, by default True.
- resolutionint, optional
Resolution of the plot, by default 800.
- squarebool, optional
Whether to enforce a square aspect ratio, by default True.
- Returns:
- holoviews.Element
A Holoviews element representing the plotted edges.
- gpmap.plot.ds.plot_visualization(nodes_df, x='1', y='2', edges_df=None, nodes_color='function', nodes_cmap='viridis', nodes_size=5, nodes_vmin=None, nodes_vmax=None, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=False, edges_width=0.5, edges_alpha=1, edges_color='grey', edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, shade_nodes=True, shade_edges=True, square=True)
Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states if provided.
- Parameters:
- nodes_dfpandas.DataFrame
DataFrame containing node data.
- xstr, optional
Column name for the x-axis, by default “1”.
- ystr, optional
Column name for the y-axis, by default “2”.
- edges_dfpandas.DataFrame, optional
DataFrame containing edge data, by default None.
- nodes_colorstr, optional
Column name for the color values, by default “function”.
- nodes_cmapstr, optional
Colormap to use for coloring nodes, by default “viridis”.
- nodes_sizeint, optional
Size of the nodes, by default 5.
- nodes_vminfloat, optional
Minimum value for color scaling, by default None.
- nodes_vmaxfloat, optional
Maximum value for color scaling, by default None.
- linewidthint, optional
Line width of the node edges, by default 0.
- edgecolorstr, optional
Color of the node edges, by default “black”.
- sort_bystr, optional
Column name to sort nodes by, by default None.
- sort_ascendingbool, optional
Whether to sort nodes in ascending order, by default False.
- edges_widthfloat, optional
Line width of the edges, by default 0.5.
- edges_alphafloat, optional
Transparency level of the edges, by default 1.
- edges_colorstr, optional
Color of the edges, by default “grey”.
- edges_cmapstr, optional
Colormap to use for coloring edges, by default “grey”.
- background_colorstr, optional
Background color of the plot, by default “white”.
- nodes_resolutionint, optional
Resolution of the nodes plot, by default 800.
- edges_resolutionint, optional
Resolution of the edges plot, by default 1200.
- shade_nodesbool, optional
Whether to use datashader for rendering nodes, by default True.
- shade_edgesbool, optional
Whether to use datashader for rendering edges, by default True.
- squarebool, optional
Whether to enforce a square aspect ratio, by default True.
- Returns:
- holoviews.Element
A Holoviews element representing the plotted visualization.
- gpmap.plot.ds.dsg_to_fig(dsg)
Convert a Holoviews element to a Matplotlib figure.
- Parameters:
- dsgholoviews.Element
A Holoviews element to be converted.
- Returns:
- matplotlib.figure.Figure
A Matplotlib figure object representing the Holoviews element.
- gpmap.plot.ds.figure_allele_grid(nodes_df, fpath, x='1', y='2', edges_df=None, positions=None, position_labels=None, edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, sort_by=None, sort_ascending=False, fmt='png', figsize=None, square=True, **kwargs)
Generate a grid of allele visualizations and save the resulting figure.
- Parameters:
- nodes_dfpandas.DataFrame
DataFrame containing node data.
- fpathstr
File path to save the resulting figure.
- xstr, optional
Column name for the x-axis, by default “1”.
- ystr, optional
Column name for the y-axis, by default “2”.
- edges_dfpandas.DataFrame, optional
DataFrame containing edge data, by default None.
- positionslist or numpy.ndarray, optional
List or array of positions to visualize, by default None.
- position_labelslist or numpy.ndarray, optional
Labels for the positions, by default None.
- edges_cmapstr, optional
Colormap to use for coloring edges, by default “grey”.
- background_colorstr, optional
Background color of the plot, by default “white”.
- nodes_resolutionint, optional
Resolution of the nodes plot, by default 800.
- edges_resolutionint, optional
Resolution of the edges plot, by default 1200.
- sort_bystr, optional
Column name to sort nodes by, by default None.
- sort_ascendingbool, optional
Whether to sort nodes in ascending order, by default False.
- fmtstr, optional
Format to save the figure, by default “png”.
- figsizetuple, optional
Size of the figure in inches, by default None.
- squarebool, optional
Whether to enforce a square aspect ratio, by default True.
Datasets
- gpmap.datasets.list_available_datasets()
Retrieve the names of all available built-in datasets.
This function scans the directory specified by LANDSCAPES_DIR and extracts the names of all files present, excluding their extensions. It returns these names as a list.
- Returns:
list: A list of strings, where each string is the name of a built-in dataset.
- class gpmap.datasets.DataSet(dataset_name, data=None, landscape=None)
DataSet object for managing and manipulating various components related to a specific dataset. This includes the original data, reconstructed landscape, and visualization coordinates.
- Parameters:
- dataset_namestr
The name of the dataset to load from the built-in list. If data or landscape are provided, this will be the name assigned to the new dataset.
- datapd.DataFrame, shape (n_obs, n_features), optional
A DataFrame containing the experimental data with genotypes as the index.
- landscapepd.DataFrame, shape (n_genotypes, 1), optional
A DataFrame containing the complete combinatorial landscape used to build the remaining components of the dataset.
- Attributes:
- data
- edges
- landscape
- nodes
- relaxation_times
Methods
calc_visualization([Ns, mean_function, ...])Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk.
plot()Makes a two panel figure with the relaxation times associated to the computed Diffusion axes and a low dimensional representation of the complete genotype-phenotype map from this
DataSet.save([fdir])Saves the dataset to disk for direct access within the library.
Generate a SequenceSpace object from the dataset's landscape.
- calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=20)
Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist of the right eigenvectors of the associated rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time.
- Parameters:
- Nsfloat, optional
Scaled effective population size to use in the underlying evolutionary model. If not provided, it will be derived from mean_function or mean_function_perc.
- mean_functionfloat, optional
Mean function at stationarity to derive the associated Ns. Either this or mean_function_perc must be provided if Ns is not specified.
- mean_function_percfloat, optional
Percentile that the mean function at stationarity takes within the distribution of function values along sequence space. For example, if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values. Either this or mean_function must be provided if Ns is not specified.
- n_componentsint, default=10
Number of eigenvectors or Diffusion axes to calculate.
- plot()
Makes a two panel figure with the relaxation times associated to the computed Diffusion axes and a low dimensional representation of the complete genotype-phenotype map from this
DataSet.
- save(fdir=None)
Saves the dataset to disk for direct access within the library.
This method stores raw data, inferred genotype-phenotype map, and the computed visualization coordinates and relaxation times when available.
- Parameters:
- fdirstr, optional
Directory where the dataset should be saved. If not provided, the default directories defined in the settings will be used.
Notes
The dataset is stored by default in the installation folder and will be deleted upon re-installation.
If a custom directory is provided, the dataset will be saved with specific suffixes for each component.
- to_sequence_space()
Generate a SequenceSpace object from the dataset’s landscape.
This method constructs a SequenceSpace object using the genotypes and their corresponding values from the dataset’s landscape.
- Returns:
- SequenceSpace
A SequenceSpace object representing the dataset’s landscape.
Utilities
Input/Output
- gpmap.utils.read_dataframe(fpath)
Read a DataFrame from a file path in different formats.
- Parameters:
- fpathstr
Path to the file containing the DataFrame. The file extension determines the format: ‘csv’ or ‘parquet’.
- Returns:
- pd.DataFrame
The DataFrame read from the file.
- Raises:
- ValueError
If the file format is not recognized.
- gpmap.utils.read_edges(fpath, log=None, return_df=True)
Reads the incidence matrix containing the adjacency information among genotypes from a sequence space.
- Parameters:
- fpathstr
File path containing the edges of a sequence space. The extension will be used to differentiate between csv, parquet, and the more efficient npz format.
- logLogTrack, optional
Logger instance to log messages. Default is None.
- return_dfbool, default=True
Whether to return a pandas DataFrame with the edges. If False, it will return a csr_matrix.
- Returns:
- edges_dfpd.DataFrame or csr_matrix
If return_df is True, returns a DataFrame with columns
iandjcontaining the indices of the genotypes that are separated by a single mutation in a sequence space. If return_df is False, returns a csr_matrix representation of the edges.
Genotype dataframes
- gpmap.genotypes.select_genotypes(nodes_df, genotypes, edges=None, is_idx=False)
Select the specified genotypes from
nodes_df, along with the corresponding edges among the remaining genotypes ifedgesare provided.- Parameters:
- nodes_dfpd.DataFrame
A DataFrame containing genotypes as the index and various features as columns. Typically, it includes at least the coordinates for visualization, but it may also retain other metadata.
- genotypesarray-like
An array of genotypes to select from the input landscape. By default, it should contain genotype labels, or indexes if the is_idx option is set to True.
- edgespd.DataFrame or scipy.sparse.csr_matrix, optional
A DataFrame or csr_matrix representing the adjacency relationships among genotypes provided in nodes_df within the discrete space. Defaults to None.
- is_idxbool, optional
Indicates whether the genotypes argument is an array of indexes instead of an array of genotype labels. Defaults to False.
- Returns:
- pd.DataFrame or tuple
If edges is None, returns a DataFrame containing the filtered genotypes. Otherwise, returns a tuple containing the filtered DataFrame and the adjacency relationships between the selected genotypes.
- gpmap.genotypes.get_genotypes_from_region(nodes_df, max_values={}, min_values={})
Filter and return the genotype labels that satisfy the specified conditions based on maximum and minimum values for the columns in the input DataFrame.
- Parameters:
- nodes_dfpd.DataFrame
DataFrame with genotypes as the index and various features as columns. Typically, it contains at least the coordinates for visualization, but it may also include other metadata.
- max_valuesdict, optional
Dictionary where keys are column names and values are the maximum thresholds for filtering genotypes. Genotypes with values greater than these thresholds in the specified columns will be excluded.
- min_valuesdict, optional
Dictionary where keys are column names and values are the minimum thresholds for filtering genotypes. Genotypes with values less than these thresholds in the specified columns will be excluded.
- Returns:
- pd.Index
Index containing the labels of genotypes that meet the specified filtering criteria.
- gpmap.genotypes.marginalize_landscape_positions(nodes_df, keep_pos=None, skip_pos=None, return_edges=False)
Marginalize specific positions in the sequences and compute the average of numeric values across the remaining genetic backgrounds.
- Parameters:
- nodes_dfpd.DataFrame
A DataFrame with sequence names as the index and at least one numeric column to compute the average across the selected genetic backgrounds.
- keep_posarray-like, optional
A list of 0-indexed positions to retain. The sequences will be averaged across all genetic backgrounds specified by the positions not included in this list. If not provided,
skip_posmust be specified.- skip_posarray-like, optional
A list of 0-indexed positions to marginalize out. The sequences will be averaged across these positions. If not provided, keep_pos must be specified.
- return_edgesbool, optional, default=False
If True, returns an additional DataFrame containing the edges of the reduced sequence space for visualization purposes.
- Returns:
- nodes_dfpd.DataFrame
A DataFrame containing the average value of every numeric column in the input DataFrame, with the subsequences at the desired positions as the index.
- edges_dfpd.DataFrame, optional
A DataFrame containing the edges of the reduced sequence space. This is only returned if
return_edges=True.
Sequence handling
- gpmap.seq.get_custom_codon_table(aa_mapping)
Constructs a Biopython CodonTable for translation using a custom genetic code.
- Parameters:
- aa_mappingpd.DataFrame
A pandas DataFrame with columns “Codon” and “Letter” representing the genetic code mapping. Stop codons should be denoted with “*”.
- Returns:
- codon_tableBio.Data.CodonTable.CodonTable
A Biopython CodonTable object that can be used for translating sequences with the specified custom genetic code.
- gpmap.seq.generate_freq_reduced_code(seqs, n_alleles, counts=None, keep_allele_names=True, last_character='X')
Generate a mapping from each allele in the observed sequences to a reduced alphabet with at most
n_allelesper site. The least frequent alleles are grouped into a single allele.- Parameters:
- seqsarray-like of shape (n_genotypes,) or (n_obs,)
Observed sequences. If
countsis None, each sequence is assumed to appear once. Otherwise, frequencies are calculated using the counts as the number of times a sequence appears in the data.- n_allelesint or array-like of shape (seq_length,)
Maximum number of alleles allowed per site. If an array is provided, each site will use the specified number of alleles. Otherwise, all sites will have the same maximum number of alleles.
- countsNone or array-like of shape (n_genotypes,)
Number of times each sequence in
seqsappears in the data. If not provided, each sequence is assumed to appear exactly once.- keep_allele_namesbool, optional
If True, allele names are preserved. Otherwise, they are replaced by new alleles taken from the alphabet. Default is True.
- last_characterstr, optional
Character to use for pooled alleles when
keep_allele_namesis True. Default is “X”.
- Returns:
- codelist of dict of length seq_length
A list of dictionaries, where each dictionary maps the original alleles to the new reduced alphabet for each site.
- gpmap.seq.msa_to_counts(X, y=None, positions=None, phylo_correction=False, max_dist=0.2)
Extracts unique sequences and their counts from a Multiple Sequence Alignment (MSA). Optionally, subsequences can be selected based on specific positions, and sequence identity re-weighting can be applied to account for sequence similarities across the full alignment.
- Parameters:
- Xarray-like of aligned sequences
Input sequences from which to extract unique sequences and counts.
- yarray-like of weights, optional (default=None)
Pre-calculated weights associated with the input sequences. If not provided, weights are calculated based on sequence identity.
- positionsarray-like of int, optional (default=None)
Subset of positions to extract subsequences from the MSA. If not provided, the full sequences are used.
- phylo_correctionbool, optional (default=False)
If True, applies sequence identity re-weighting. Observations are weighted as 1 divided by the number of similar sequences in the MSA. Similar sequences are defined based on the max_dist parameter.
- max_distfloat, optional (default=0.2)
Maximum sequence identity distance for considering sequences as similar during re-weighting. Only used if phylo_correction is True.
- Returns:
- Xnp.array of shape (n_unique_seqs,)
Unique subsequences at the specified positions in the MSA.
- ynp.array of shape (n_unique_seqs,)
Counts or re-weighted counts for each unique subsequence in the MSA.