API Reference

Discrete spaces

class gpmap.space.DiscreteSpace(adjacency_matrix, y=None, state_labels=None)

Class to define an arbitrary discrete space characterized uniquely by the connectivity between the different states and optionally by the function e.g. fitness or energy at each state of the discrete space

Parameters:
adjacency_matrix: scipy.sparse.csr_matrix of shape (n_states, n_states)

Sparse matrix representing the adjacency relationships between states. The ij’th entry contains a 1 if the states i and j are connected and 0 otherwise

y: array-like of shape (n_states,)

Quantitative property associated to each state

state_labels: array-like of shape (n_genotypes, )

State labels in the sequence space

Attributes:
n_states: int

Number of states in the discrete space

state_labels: array-like of shape (n_genotypes, )

State labels in the sequence space

state_idxs: pd.Series of shape (n_genotypes, )

pd.Series containing the index of each state. It has state_labels as index of the Series and can be used to quickly extract the index corresponding to a set of state labels

is_regular: bool

Boolean variable storing whether the resulting graph is regular or not, this is, whether each node has the same number of neighbors

Methods

get_neighbor_pairs()

Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace

get_neighbors(states[, max_distance])

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

get_state_idxs(states)

Returns the indexes for the provided state labels

get_edges_df

get_neighbor_pairs()

Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace

get_neighbors(states, max_distance=1)

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

Parameters:
statesarray-like of shape (state_number,)

np.array or list of states from which to select the neighbors

max_distanceint (1)

The maximal distance at which neighbors from the provided states will be returned

Returns:
neighbor_statesnp.array

Array containing the state labels in the d-neighborhood of `states

get_neighbors(states, max_distance=1)

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

Parameters:
statesarray-like of shape (state_number,)

np.array or list of states from which to select the neighbors

max_distanceint (1)

The maximal distance at which neighbors from the provided states will be returned

Returns:
neighbor_statesnp.array

Array containing the state labels in the d-neighborhood of `states

get_state_idxs(states)

Returns the indexes for the provided state labels

class gpmap.space.ProductSpace(elementary_graphs, y=None, state_labels=None)

General class for spaces that can be built as cartesian products of smaller subspaces characterized by a set of elementary graphs

Parameters:
elementary_graphs: csr_matrices

List csr_matrices for the adjacency matrices from which to build the product space

y: None or array-like of shape (n,)

np.array containing the phenotypic values associated to each combination of states in the resulting space. If y=None, no phenotypic values will be stored

state_labels: None or list

List with the labels associated to each of the possible states a in each of the l elements of the product space. If state_labels=None, numeric labels will be given by default.

Attributes:
is_regular

Attribute characterizing whether the space is regular, this is, every

n_edges

Methods

get_neighbor_pairs()

Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace

get_neighbors(states[, max_distance])

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

get_state_idxs(states)

Returns the indexes for the provided state labels

calc_adjacency_matrix

calc_states

format_list_ends

format_values

get_edges_df

get_y

init_space

set_dim_sizes

set_y

write_csv

write_edges

class gpmap.space.GridSpace(length, y=None, ndim=2)

Class for creating an N-dimensional grid discrete space

Parameters:
length: int or array-like

Number of states across each dimension of the grid. If an integer is provided, all dimensions of the grid will have the same length. If a series of lengths is provided, they will be used to form a grid of dimensions with the specified lengths and the ndim argument will be ignored

ndim: int

Number of dimensions in the grid with a single length value.

y: array-like of shape (length ** ndim,) or None

Phenotypic values associated to each possible state

Methods

set_peaks

class gpmap.space.SequenceSpace(X=None, y=None, seq_length=None, n_alleles=None, alphabet_type='dna', alphabet=None, stop_y=None)

Class for creating a Sequence space characterized by having sequences as states. States are connected in the discrete space if they differ by a single position in the sequence. It can be created in two different ways:

  • From a set of sequences and function values X, y

  • By specifying the properties of the sequence space (alphabet, sequence length, number of alleles per site and type of alphabet).

Parameters:
X: array-like of shape (n_genotypes,)

Sequences to use as state labels of the discrete sequence space

y: array-like of shape (n_genotypes,)

Quantitative phenotype or fitness associated to each genotype

seq_length: int (None)

Length of the sequences in the sequence space. If not given, it will be guessed from alphabet or n_alleles

n_alleles: list of size `seq_length` (None)

List containing the number of alleles present in each of the sites of the sequence space. It can only be specified for alphabet_type=custom

alphabet_type: str (‘dna’)

Sequence type: {‘dna’, ‘rna’, ‘protein’, ‘custom’}

alphabet: list of `seq_length’ lists

Every element of the list is itself a list containing the different alleles allowed in each site. Note that the number and type of alleles can be different for every site.

stop_y: float (None)

Value of the function given for protein sequence with an in-frame stop codon. If given, it will increase the protein alphabet to incorporate * for stops

Attributes:
n_genotypes: int

Number of states in the complete sequence space

genotypes: array-like of shape (n_genotypes, )

Genotype labels in the sequence space

adjacency_matrix: scipy.sparse.csr_matrix of shape

(n_genotypes, n_genotypes)

Sparse matrix representing the adjacency relationships between genotypes. The ij’th entry contains a 1 if the genotypes i and j are separated by a single mutation and 0 otherwise

y: array-like of shape (n_genotypes,)

Quantitative phenotype or fitness associated to each genotype

is_regular: bool

Boolean variable storing whether the resulting Hamming graph is regular or not. In other words, whether every site has the same number of alleles

Methods

get_single_mutant_matrix(sequence[, center])

Returns the effects of single point mutations from a focal sequences

remove_codon_incompatible_transitions([...])

Recalculates the adjacency matrix of the discrete space to only allow transitions that are compatible with the specified codon table

to_nucleotide_space([codon_table, alphabet_type])

Transforms a protein space into a nucleotide space using a codon table for translating the sequence

get_single_mutant_matrix(sequence, center=False)

Returns the effects of single point mutations from a focal sequences

Parameters:
sequence: str

String encoding the sequence from which to report all single point mutant effects

center: bool (False)

If True, results will be centered by position, so that the mean of allelic effects is 0. If False, the focal sequence will have 0 and values would represent mutational effects from it

Returns:
output: pd.DataFrame of shape (seq_length, total_alleles)

pd.DataFrame containin the mutational or allelic effects for each allele across all sequence positions

remove_codon_incompatible_transitions(codon_table='Standard')

Recalculates the adjacency matrix of the discrete space to only allow transitions that are compatible with the specified codon table

Parameters:
codon_table: str or Bio.Data.CodonTable

NCBI code for an existing genetic code or a custom CodonTable object to translate nucleotide sequences into protein

to_nucleotide_space(codon_table='Standard', alphabet_type='dna')

Transforms a protein space into a nucleotide space using a codon table for translating the sequence

Parameters:
codon_table: str or Bio.Data.CodonTable

NCBI code for an existing genetic code or a custom CodonTable object to translate nucleotide sequences into protein

alphabet_type: str (‘dna’)

Sequence type to use in the resulting nucleotide space It can only take one of the following values {‘dna’, ‘rna’}

Returns:
SequenceSpace

Nucleotide sequence space with 4 alleles per site and 3 times the number of sites of the current space

class gpmap.space.HammingBallSpace(X0, X=None, y=None, d=None, n_alleles=None, alphabet_type='dna', alphabet=None)

Class for the space representing the Hamming ball around a target sequence up to a certain number of mutations from it.

Parameters:
X0: str

Focal sequence around which to build the Hamming ball space

X: array-like of shape (n_genotypes,)

Sequences to use as state labels of the discrete sequence space

y: array-like of shape (n_genotypes,)

Quantitative phenotype or fitness associated to each genotype

d: int (None)

Maximum distance from the focal sequence to include in the space

n_alleles: list of size `seq_length` (None)

List containing the number of alleles present in each of the sites of the sequence space. It can only be specified for alphabet_type=custom

alphabet_type: str (‘dna’)

Sequence type: {‘dna’, ‘rna’, ‘protein’, ‘custom’}

alphabet: list of `seq_length’ lists

Every element of the list is itself a list containing the different alleles allowed in each site. Note that the number and type of alleles can be different for every site.

Attributes:
n_genotypes: int

Number of states in the complete sequence space

genotypes: array-like of shape (n_genotypes, )

Genotype labels in the sequence space

adjacency_matrix: scipy.sparse.csr_matrix of shape

(n_genotypes, n_genotypes)

Sparse matrix representing the adjacency relationships between genotypes. The ij’th entry contains a 1 if the genotypes i and j are separated by a single mutation and 0 otherwise

y: array-like of shape (n_genotypes,)

Quantitative phenotype or fitness associated to each genotype

is_regular: bool

Boolean variable storing whether the resulting Hamming graph is regular or not. In other words, whether every site has the same number of alleles

Methods

get_neighbor_pairs()

Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace

get_neighbors(states[, max_distance])

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

get_state_idxs(states)

Returns the indexes for the provided state labels

calc_adjacency_matrix

calc_graph

calc_max_min_path

calc_n_paths

format_list_ends

format_values

get_edges_df

get_genotypes

get_y

init_space

set_alphabet_type

set_seq_length

set_y

write_csv

write_edges

Random walks

class gpmap.randwalk.WMWalk(space, log=None, Ns=None)

Class for Weak Mutation Weak Selection Random Walk on a SequenceSpace. It is a time-reversible continuous time Markov Chain where the transition rates depend on the differences in fitnesses between two states scaled by the effective population size Ns .

Attributes:
spaceDiscreteSpace class

Space on which the random walk takes place

Nsfloat

Scaled effective population size for the evolutionary model

rate_matrixcsr_matrix

Rate matrix defining the continuous time process

Methods

set_Ns():

Method to specify the scaled effective population size Ns, either directly or by specifying the mean function at stationarity or the percentile it represents from the distribution of functions across sequence space

calc_stationary_frequencies():

Calculates the stationary frequencies of the states under the random walk specified on the discrete space

calc_rate_matrix():

Calculates the rate matrix for the continuous time process given the scaled effective population size (Ns) or average phenotype at stationarity.

calc_neutral_mixing_rates(site_exchange_rates, neutral_site_freqs)

Calculates the neutral mixing rates for a SequenceSpace In case no GTR mutation model is specified, then the neutral mixing rates is limited by the site with the least number of alleles. Otherwise, as we assume that mutations are site-independent, the slowest neutral mixing rate is going to by limited by the slowest site, provided by the smallest of second eigenvalues in the site rate matrices

Parameters:
neutral_site_Qslist of array-like of shape (n_alleles, n_alleles)

List containing site-specific rate matrices to use for calculating the limiting mixing in the neutral case. If not provided, uniform mutation rates are assumed.

neutral_site_freqslist of array-like of shape (n_alleles,)

List containing vectors with the stationary frequencies under neutrality for each site. They are used to calculate the eigenvalues of the time reversible site specific neutral chain. By default, they are assumed to be uniform across sites and alleles.

site_weightsarray-like of shape (seq_length,)

Vector containing the relative weight associated to each site. This value is used to scale the individually normalized rates matrices to ensure this specific leaving rate. By default, all weights are equal

Returns:
neutral_mixing_rate: float

Neutral mixing rate as the smallest second largest eigenvalue across sites.

TODO: Re-implement functionality
calc_rate_matrix(Ns=None, neutral_stat_freqs=None, neutral_exchange_rates=None)

Calculates the rate matrix for the random walk in the discrete space and stores it in the attribute rate_matrix

Parameters:
Nsreal

Scaled effective population size for the evolutionary model

neutral_stat_freqsarray-like of shape (n_states,)

Genotype stationary frequencies at neutrality to define the time reversible neutral dynamics

neutral_exchange_rates: scipy.sparse.csr.csr_matrix of shape

(n_states, n_states)

Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.

calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=10, neutral_exchange_rates=None, neutral_stat_freqs=None, tol=1e-12)

Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist on the right eigenvectors of the associate rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time

Parameters:
Nsfloat

Scaled effective population size to use in the underlying evolutionary model

mean_functionfloat

Mean function at stationarity to derive the associated Ns

mean_function_perc: float

Percentile that the mean function at stationarity takes within the distribution of function values along sequence space e.g. if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values

n_components: int (10)

Number of eigenvectors or diffusion axis to calculate

neutral_stat_freqsarray-like of shape (n_states,)

Genotype stationary frequencies at neutrality to define the time reversible neutral dynamics

neutral_exchange_rates: scipy.sparse.csr.csr_matrix of

shape (n_states, n_states)

Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.

write_tables(prefix, write_edges=False, nodes_format='parquet', edges_format='npz')

Write the output of the visualization in tables with a common prefix. The output can consist in 2 to 3 different tables, as one of them may not be always necessarily stored multiple times

  • nodes coordinates : contains the coordinates for each state and

the associated function values and stationary frequencies. It is stored in CSV format with suffix “nodes.csv” or parquet with suffix “nodes.pq” - decay rates : contains the decay rates and relaxation times associated to each component or diffusion axis. It is stored in CSV format with suffix “decay_rates.csv” - edges : contains the adjacency relationship between states. It is not stored by default unless write_edges=True, as it will remain unchanged for any visualization on the same SequenceSpace. Therefore, so it only needs to be stored once. It can be stored in CSV format, or in the more efficent npz format for sparse matrices

Parameters:
prefix: str

Prefix of the files to store the different tables

write_edges: bool (False)

Option to write also the information about the adjacency relationships between pairs for states for plotting the edges

nodes_format: str {‘parquet’, ‘csv’}

Format to store the nodes information. parquet is more efficient but CSV can be used in smaller cases for plain text storage.

edges_format: str {‘npz’, ‘csv’}

Format to store the edges information. npz is more efficient but CSV can be used in smaller cases for plain text storage.

Landscape Inference

class gpmap.inference.MinimumEpistasisInterpolator(P=2, n_alleles=None, seq_length=None, alphabet_type='custom', cg_rtol=1e-16)

Methods

predict()

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

smooth

predict()

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

Returns:
functionpd.DataFrame of shape (n_genotypes, 1)

Returns the phenotypic predictions for each input genotype in the column ypred and genotype labels as row names. If calc_variance=True, then it has an additional column with the posterior variances for each genotype

class gpmap.inference.MinimumEpistasisRegression(P, a=None, n_alleles=None, seq_length=None, alphabet_type='custom', nfolds=5, num_reg=20, min_log_reg=-2, max_log_reg=6, progress=True, cg_rtol=0.0001)

Methods

fit(X, y[, y_var, cross_validation])

Infers the optimal a from the provided data, this is, the magnitude of Pth order local epistatic coefficients that maximize predictive performance in held out data

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

fit(X, y, y_var=None, cross_validation=False)

Infers the optimal a from the provided data, this is, the magnitude of Pth order local epistatic coefficients that maximize predictive performance in held out data

Parameters:
Xarray-like of shape (n_obs,)

Vector containing the genotypes for which have observations provided by y

yarray-like of shape (n_obs,)

Vector containing the observed phenotypes corresponding to X sequences

y_vararray-like of shape (n_obs,)

Vector containing the empirical or experimental known variance for the measurements in y

Returns:
afloat

Optimal a value maximing the cross-validated log-likelihood

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

Parameters:
contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts)

DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution

Returns:
contrasts: pd.DataFrame of shape (n_contrasts, 5)

DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.

class gpmap.inference.VCregression(lambdas=None, n_alleles=None, seq_length=None, alphabet_type='custom', beta=0, cross_validation=False, nfolds=5, cv_loss_function='frobenius_norm', num_beta=20, min_log_beta=-2, max_log_beta=7, cg_rtol=1e-16, progress=True)

Variance Component regression model that allows inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior parametrized by the contribution of the different orders of interaction to the observed genetic variability of a continuous phenotype

It requires the use of the same number of alleles per sites

Methods

fit(X, y[, y_var])

Infers the variance components from the provided data, this is, the relative contribution of the different orders of interaction to the variability in the sequence-function relationships

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

predict([X_pred, calc_variance])

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

simulate([sigma, p_missing])

Simulates data under the specified Variance component priors

lambdas_to_variance

fit(X, y, y_var=None)

Infers the variance components from the provided data, this is, the relative contribution of the different orders of interaction to the variability in the sequence-function relationships

Stores learned lambdas in the attribute VCregression.lambdas to use internally for predictions and returns them as output

Parameters:
Xarray-like of shape (n_obs,)

Vector containing the genotypes for which have observations provided by y

yarray-like of shape (n_obs,)

Vector containing the observed phenotypes corresponding to X sequences

y_vararray-like of shape (n_obs,)

Vector containing the empirical or experimental known variance for the measurements in y

Returns:
lambdas: array-like of shape (seq_length + 1,)

Variances for each order of interaction k inferred from the data

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

Parameters:
contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts)

DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution

Returns:
contrasts: pd.DataFrame of shape (n_contrasts, 5)

DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

Parameters:
X_predarray-like of shape (n_genotypes,)

Vector containing the genotypes for which we want to predict the phenotype. If n_genotypes == None then predictions are provided for the whole sequence space

calc_variancebool (False)

Option to also return the posterior variances for each individual genotype

Returns:
functionpd.DataFrame of shape (n_genotypes, 1)

Returns the phenotypic predictions for each input genotype in the column ypred and genotype labels as row names. If calc_variance=True, then it has an additional column with the posterior variances for each genotype

simulate(sigma=0, p_missing=0)

Simulates data under the specified Variance component priors

Parameters:
sigmareal

Standard deviation of the experimental noise additional to the variance components

p_missingfloat between 0 and 1

Probability of randomly missing genotypes in the simulated output data

Returns:
datapd.DataFrame of shape (n_genotypes, 3)

DataFrame with the columns y_true, y``and ``var corresponding to the true function at each genotype, the observed values and the variance of the measurement respectively for each sequence or genotype indicated in the DataFrame.index

class gpmap.inference.SeqDEFT(P, n_alleles=None, seq_length=None, alphabet_type='custom', genotypes=None, a=None, num_reg=20, nfolds=5, lambdas_P_inv=None, a_resolution=0.1, max_a_max=1000000000000.0, fac_max=0.1, fac_min=1e-06, optimization_opts={}, maxiter=10000, gtol=1e-06, ftol=1e-08)

Sequence Density Estimation using Field Theory model that allows inference of a complete sequence probability distribution under a Gaussian Process prior parameterized by variance of local epistatic coefficients of order P

It requires the use of the same number of alleles per sites

Parameters:
Pint

Order of the local interaction coefficients that we are penalized under the prior i.e. P=2 penalizes local pairwise interaction across all posible faces of the Hamming graph while P=3 penalizes local 3-way interactions across all possible cubes.

afloat (None)

Parameter related to the inverse of the variance of the P-order epistatic coefficients that are being penalized. Larger values induce stronger penalization and approximation to the Maximum-Entropy model of order P-1. If a=None the best a is found through cross-validation

num_regint (20)

Number of a values to evaluate through cross-validation

nfolds: int (5)

Number of folds to use in the cross-validation procedure

Methods

fit(X[, y, baseline_phi, baseline_X, ...])

Infers the sequence-function relationship under the specified Delta^{(P)} prior

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

simulate(N[, phi, seed])

Simulates data under the specified a penalization for local P-epistatic coefficients

simulate_phi()

Simulates data under the specified a penalization for local P-epistatic coefficients

fit(X, y=None, baseline_phi=None, baseline_X=None, positions=None, phylo_correction=False, adjust_freqs=False, allele_freqs=None)

Infers the sequence-function relationship under the specified Delta^{(P)} prior

Parameters:
Xarray-like of shape (n_obs,)

Vector containing the observed sequences

yarray-like of shape (n_obs,)

Vector containing the weights for each observed sequence. By default, each sequence takes a weight of 1. These weights can be calculated using phylogenetic correction

baseline_X: array-like of shape (n_genotypes,)

Vector containing the sequences associated with baseline_phi

baseline_phi: array-like of shape (n_genotypes,)

Vector containing the baseline_phi to include in the model

positions: array-like of shape (n_pos,)

If provided, subsequences at these positions in the provided input sequences will be used as input

phylo_correction: bool (False)

Apply phylogenetic correction using the full length sequences

adjust_freqs: bool (False)

Whether to correct densities by the expected allele frequencies in the full length sequences

allele_freqs: dict or codon_table

Dictionary containing the allele expected frequencies frequencies for every allele in the set of possible sequences or the codon table to use to genereate expected aminoacid frequencies If None, they will be calculated from the full length observed sequences.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

Parameters:
contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts)

DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution

Returns:
contrasts: pd.DataFrame of shape (n_contrasts, 5)

DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.

simulate(N, phi=None, seed=None)

Simulates data under the specified a penalization for local P-epistatic coefficients

Parameters:
Nint

Number of total sequences to sample

phiarray-like of shape (n_genotypes,)

Vector containing values for the field underlying the probability distribution from which to sample sequences. If provided, they will be used instead of sampling them from the prior characterized by the given a.

seed: int (None)

Random seed to use for simulation

Returns:
Xarray-like of shape (N,)

Vector containing the sampled sequences from the probability distribution

simulate_phi()

Simulates data under the specified a penalization for local P-epistatic coefficients

Returns:
phiarray-like of shape (n_genotypes,)

Vector containing values for the latent phenotype or field sampled from the prior characterized by a

Sequence utils

gpmap.seq.guess_space_configuration(seqs, ensure_full_space=True, force_regular=False, force_regular_alleles=False)

late Guess the sequence space configuration from a collection of sequences This allows to have different number of alleles per site and maintain the order in which alleles appear in the sequences when enumerating the alleles per position

Parameters:
seqs: array-like of shape (n_genotypes,)

Vector or list containing the sequences from which we want to infer the space configuration

ensure_full_space: bool

Option to ensure that the whole sequence space must be represented by the set of provided sequences. This is a useful feature to identify whether there are missing genotypes before defining the space and random walk to visualize the full landscape.

force_regular: bool

Option to ensure that there are the same number of alleles per site. New allele names will be added to sites with less than the maximum number of alleles across sites

force_regular_alleles: bool

Option to additionally ensure that the same alleles are common across all sites

Returns:
config: dict with keys {‘length’, ‘n_alleles’, ‘alphabet’}

Returns a dictionary with the inferred configuration of the discrete space where the sequences come from.

gpmap.seq.get_custom_codon_table(aa_mapping)

Builds a biopython CodonTable to use for translation with a custom genetic code

Parameters:
aa_mapping: pd.DataFrame

pandas DataFrame with columns “Codon” and “Letter” representing the genetic code correspondence. Stop codons should appear as “*”

Returns:
codon_table: Bio.Data.CodonTable.CodonTable object

Standard bioptython codon table object to use for translating sequences

gpmap.seq.get_one_hot_from_alleles(alphabet)

Returns a one hot encoding CSR matrix for a complete combinatorial space It uses a fast recursive method to avoid repetition of building common blocks in the full matrix

Parameters:
alphabetlist of list

List containing lists of alleles per site in a sequence space

Returnsscipy.sparse.csr_matrix of shape (n_genotypes, total_n_alleles)

csr matrix containing the one hot encoding of the full sequence space as with genotypes sorted lexicographically

gpmap.seq.get_alphabet(n_alleles=None, alphabet_type=None)

Returns the resulting alphabet from specifying either the type or the number of alleles per site

Parameters:
n_allelesint

Number of alleles per site

alphabet_typestr

Type of alphabet to use out of {None, ‘dna’, ‘rna’, ‘protein’}

Returns:
alphabetlist

List containing the alleles in the desired alphabet

gpmap.seq.generate_freq_reduced_code(seqs, n_alleles, counts=None, keep_allele_names=True, last_character='X')

Returns a list of dictionaries with the mapping from each allele in the observed sequences to a reduced alphabet with at most n_alleles per site. The least frequent alleles are pooled together into a single allele

Parameters:
seqsarray-like of shape (n_genotypes,) or (n_obs,)

Observed sequences. If counts=None, then every sequence is counted once. Otherwise, frequencies are calculated using the counts as the number of times a certain sequence appears in the data

n_allelesint or array-like of shape (seq_length, )

Maximal number of alleles per site allowed. If a list or array is provided each site will use the specified number of alleles. Otherwise, all sites will have the same maximum number of alleles

countsNone or array-like of shape (n_genotypes, )

Number of times every sequence in seqs appears in the data. If not provided, every provided sequence is assumed to appear exactly once

keep_allele_namesbool

If keep_allele_names=True, then allele names are preserved. Otherwise they are replace by new alleles taken from the alphabet

last_characterstr

Character to use for remaining alleles when keep_allele_names=True

Returns:
codelist of dict of length seq_length

List of dictionaries containing the new allele corresponding to each of the original alleles for each site.

gpmap.seq.transcribe_seqs(seqs, code)
gpmap.seq.translate_seqs(seqs, codon_table='Standard')
gpmap.seq.msa_to_counts(X, y=None, positions=None, phylo_correction=False, max_dist=0.2)

Obtains a series of sequences and their counts from a Multiple Sequence Alignment (MSA) provided as a list of sequences. It can select subsequences by selecting which positions to look at in the MSA and do sequence identity re-weighting by considering the sequence similarities across the full length sequence

Parameters:
Xarray-like of aligned sequences

Input sequences from which to extract counts

yarray-like of weights (None)

Pre-calculated weights associated to the input sequences

positionsarray-like of positions (None)

If provided, subsequences at this subset of positions will be used to provide counts or re-weighted counts

phylo_correctionbool (False)

If True, observations will be re-weighted using sequence similarity along the whole sequence as 1 over the number of similar sequences in the MSA. Similar sequences are defined as those that differ less from each other than the specified `max_dist`

max_distfloat (0.2)

Pairs of sequences that differ more than this value will be consired similar for re-weighting

Returns:
X: np.array of shape (n_unique_seqs, )

Unique subsequences at the specified positions in the MSA

y: np.array of shape (n_unique_seqs, )

Counts or re-weighted counts for each of the unique subsequences in the MSA

Genotypes handling

gpmap.utils.read_dataframe(fpath)
gpmap.utils.read_edges(fpath, log=None, return_df=True)

Reads the incidence matrix containing the adjacency information among genotypes from a sequence space

Parameters:
fpathstr

File path containing the edges of a sequence space. The extension will be used to differentiate between csv and the more efficient npz format

return_dfbool (True)

Whether to return a pd.DataFrame with the edges. Alternatively it will return a csr_matrix

Returns:
edges_dfpd.DataFrame of shape (n_edges, 2) or csr_matrix

DataFrame with column names i and j containing the indexes of the genotypes that are separated by a single mutation in a sequence space

gpmap.genotypes.select_genotypes(nodes_df, genotypes, edges=None, is_idx=False)

Selects the provided genotypes from nodes_df with the corresponding edges among the remaining genotypes if edges are provided

Parameters:
nodes_df: pd.DataFrame of shape (n_genotypes, n_features)

DataFrame with the genotypes from a full sequence space as index Typically, it will contain, at least, the coordinates of the visualization for each genotype, but it will keep any other column in the DataFrame for later use

genotypes: array-like of shape (n_genotypes,)

Array of ordered genotypes to select from the starting landscape It should contain the genotype labels by default, or indexes if option is_idx is provided

edges: pd.DataFrame of shape (n_edges, 2) or scipy.sparse.csr_matrix

of shape (n_genotypes, n_genotypes)

DataFrame or csr_matrix containing the adjacency relationships among genotypes provided in nodes_df in the discrete space

is_idx: bool

The genotypes argument is an array of indexes instead of an array of genotype labels to select genotypes

Returns:
output: (nodes_df, edges)

Filtered landscape containing the selected genotypes and the adjacency relationships between them given as a tuple

gpmap.genotypes.get_genotypes_from_region(nodes_df, max_values={}, min_values={})

Returns the genotype labels matching the specified conditions as maximum and minimum values of the dataframe

Parameters:
nodes_df: pd.DataFrame of shape (n_genotypes, n_features)

DataFrame with the genotypes from a full sequence space as index Typically, it will contain, at least, the coordinates of the visualization for each genotype, but it will keep any other column in the DataFrame for later use

max_valuesdict

Dictionary with column names as keys and max values to filter genotypes as values

min_valuesdict

Dictionary with column names as keys and min values to filter genotypes as values

Returns:
genotypesarray-like of shape (n_selected,)

Array containing the selected genotypes from the input dataframe

gpmap.genotypes.marginalize_landscape_positions(nodes_df, keep_pos=None, skip_pos=None, return_edges=False)

Averages out some positions in the sequences for all numeric values provided in the input dataframe

Parameters:
nodes_dfpd.DataFrame

DataFrame with sequence names as index and at least one numeric column to calculate the average across the selected backgrounds

keep_posarray-like (None)

If provided, list of 0-index positions that are to be preserved and averaged across all genetic backgrounds specified by the remaining positions

skip_posarray-like (None)

If provided, list of 0-index positions to average out

return_edgesbool (False)

Return also an edges_df DataFrame to use directly for visualization

Returns:
nodes_dfpd.DataFrame

DataFrame containing the average value of every numeric column in the input DataFrame with the subsequences at the desired positions as index

edges_dfpd.DataFrame

DataFrame containing the edges of the reduced sequence space. It will only be provided if `return_edges=True`

Plotting

gpmap.plot.mpl.plot_relaxation_times(decay_df, axes=None, fpath=None, log_scale=False, neutral_time=None, kwargs={})

Plots the relaxation times associated to each of the calculated components from using ``WMWalk.calc_visualization

Parameters:
decay_dfpd.DataFrame of shape (n_components, 3)

pd.DataFrame containing the decay rates and the associated mean relaxation times for each of the calculated components

axesmatplotlib Axes object (None)

Axes where to plot. If not provided, a new figure will be created automatically for this plot and save in the path provided by fpath

fpathstr (None)

File path to store the plot. If fpath=None, axes argument must be provided for plotting.

log_scalebool (False)

Plot the relaxation times in log scale

neutral_timefloat (None)

If provided, an additional horizontal line will be plotted representing the relaxation time associated to the neutral process. This is useful when selecting the number of relevant dimensions to plot

kwargsdict

Additional key-word arguments dictionary provided for axes.plot and axes.scatter e.g. color.

gpmap.plot.mpl.plot_edges(axes, nodes_df, edges_df, x='1', y='2', z=None, alpha=0.1, zorder=1, color='grey', cbar=True, cmap='binary', cbar_axes=None, cbar_orientation='vertical', cbar_label='', palette=None, legend=True, legend_loc=0, width=0.5, max_width=1, min_width=0.1, fontsize=None)

Plots the edges representing the connections between states that are conneted in the discrete space under a particular embedding

Parameters:
axesmatplotlib Axes in which to plot the edges.
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)

pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed

edges_dfpd.DataFrame of shape (n_edges, 2)

pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.

xstr (‘1’)

Column in nodes_df to use for plotting the genotypes on the x-axis

ystr (‘2’)

Column in nodes_df to use for plotting the genotypes on the y-axis

zstr (None)

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.

alphafloat (0.1)

Transparency of lines representing the edges

zorderint (1)

Order in which the edges will be rendered relative to other elements. Generally, we would want this to be smaller than the zorder used for plotting the nodes

colorstr (‘grey’)

Column name for the values according to which edges will be colored or the specific color to use for plotting the edges

cmapcolormap or str

Colormap to use for coloring the edges according to column color

widthfloat or str

Width of the lines representing the edges. If a float is provided, that will be the width used to plot every edges. If str, then widths will be scaled according to the corresponding column in edges_df.

max_widthfloat (1)

Maximum linewidth for the edges when scaled by

min_widthfloat (0.1)

Maximum linewidth for the edges when scaled by

Returns:
line_collectionLineCollection or Line3DCollection
gpmap.plot.mpl.plot_nodes(axes, nodes_df, x='1', y='2', z=None, alpha=1, zorder=2, sort_by=None, sort_ascending=False, color='function', cmap='viridis', cbar=True, cbar_axes=None, cbar_label='Function', cbar_orientation='vertical', vcenter=None, vmax=None, vmin=None, palette='Set1', size=2.5, max_size=40, min_size=1, lw=0, edgecolor='black', legend=True, legend_loc=0)

Plots the nodes representing the states of the discrete space on the provided coordinates

Parameters:
axesmatplotlib Axes in which to plot the nodes or states.
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)

pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed

xstr (‘1’)

Column in nodes_df to use for plotting the genotypes on the x-axis

ystr (‘2’)

Column in nodes_df to use for plotting the genotypes on the y-axis

zstr (None)

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.

alphafloat (1)

Transparency of markers representing the nodes

zorderint (2)

Order in which the nodes will be rendered relative to other elements. Generally, we would want this to be bigger than the zorder used for plotting the edges

colorstr (‘grey’)

Column name for the values according to which states will be colored or the specific color to use for plotting the states

vcenterbool (False)

Center the color scale around the 0 value

vmaxfloat

Maximum value to show in the colormap

vminfloat

Minimum value to show in the colormap

cmapcolormap or str

Colormap to use for coloring the nodes according to column color

cbarbool

Boolean variable representing whether to show the colorbar

cbar_labelstr

Label for the colorbar associated to the nodes color scale

cbar_axesmatplotlib Axes

Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes

palettedict

Dictionary containing the colors associated to the categories specified by the column color in nodes_df, if they express categories rather than numerical values

sizefloat (2.5)

Size of the markers provided for plotting to axes.scatter. If a float is provided, that will be the size used to plot every nodes. If str, then node sizes will be scaled according to the corresponding column in nodes_df.

max_sizefloat (1)

Maximum linewidth for the edges when scaled by

min_sizefloat (0.1)

Maximum linewidth for the edges when scaled by

lwfloat (0)

Width of the line edges delimiting the markers representing the nodes

edgecolorstr (‘black’)

Color of the line edges delimiting the markers representing the nodes

legend: bool (True)

Show legend on the plot

legend_locint or tuple

Location of the legend in case of coloring according to a categoric variable

Returns:
line_collectionLineCollection or Line3DCollection
gpmap.plot.mpl.plot_visualization(axes, nodes_df, edges_df=None, x='1', y='2', z=None, nodes_alpha=1, nodes_zorder=2, nodes_color='function', nodes_cmap='viridis', nodes_palette=None, nodes_vmin=None, nodes_vmax=None, nodes_vcenter=False, nodes_cbar=True, nodes_cbar_axes=None, nodes_cmap_label='Function', nodes_size=2.5, nodes_min_size=1, nodes_max_size=40, nodes_lw=0, nodes_edgecolor='black', edges_alpha=0.1, edges_zorder=1, edges_color='grey', edges_cmap='binary', edges_palete=None, edges_cbar=False, edges_cbar_axes=None, edges_width=0.5, edges_max_width=1, edges_min_width=0.1, sort_by=None, sort_ascending=True, center_spines=False, add_hist=False, inset_cbar=False, inset_pos=(0.7, 0.7), prev_nodes_df=None)

Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states that are conneted if provided

Parameters:
axesmatplotlib

matplotlib Axes in which to plot the edges.

pd.DataFrame of shape (n_genotypes, n_variables)

pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed

pd.DataFrame of shape (n_edges, 2)

pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.

xstr, optional

_description_, by default ‘1’

ystr, optional

_description_, by default ‘2’

z_type_, optional

_description_, by default None

nodes_alphaint, optional

_description_, by default 1

nodes_zorderint, optional

_description_, by default 2

nodes_colorstr, optional

_description_, by default ‘function’

nodes_cmapstr, optional

_description_, by default ‘viridis’

nodes_palette_type_, optional

_description_, by default None

nodes_vmin_type_, optional

_description_, by default None

nodes_vmax_type_, optional

_description_, by default None

nodes_vcenterbool, optional

_description_, by default False

nodes_cbarbool, optional

_description_, by default True

nodes_cbar_axes_type_, optional

_description_, by default None

nodes_cmap_labelstr, optional

_description_, by default ‘Function’

nodes_sizefloat, optional

_description_, by default 2.5

nodes_min_sizeint, optional

_description_, by default 1

nodes_max_sizeint, optional

_description_, by default 40

nodes_lwint, optional

_description_, by default 0

nodes_edgecolorstr, optional

_description_, by default ‘black’

edges_alphafloat, optional

_description_, by default 0.1

edges_zorderint, optional

_description_, by default 1

edges_colorstr, optional

_description_, by default ‘grey’

edges_cmapstr, optional

_description_, by default ‘binary’

edges_palete_type_, optional

_description_, by default None

edges_cbarbool, optional

_description_, by default False

edges_cbar_axes_type_, optional

_description_, by default None

edges_widthfloat, optional

_description_, by default 0.5

edges_max_widthint, optional

_description_, by default 1

edges_min_widthfloat, optional

_description_, by default 0.1

sort_by_type_, optional

_description_, by default None

sort_ascendingbool, optional

_description_, by default False

center_spinesbool, optional

_description_, by default False

add_histbool, optional

_description_, by default False

inset_cbarbool, optional

_description_, by default False

inset_postuple, optional

_description_, by default (0.7, 0.7)

prev_nodes_df_type_, optional

_description_, by default None

gpmap.plot.mpl.figure_Ns_grid(rw, x='1', y='2', pmin=0, pmax=0.8, ncol=4, nrow=3, show_edges=True, fpath=None, **kwargs)
gpmap.plot.mpl.figure_allele_grid(nodes_df, edges_df=None, allele_color='orange', background_color='lightgrey', positions=None, position_labels=None, colsize=3, rowsize=2.7, xpos_label=0.05, ypos_label=0.92, fmt='png', fpath=None, **kwargs)
gpmap.plot.ply.plot_visualization(nodes_df, edges_df=None, x='1', y='2', z=None, nodes_color='function', nodes_size=4, nodes_cmap='viridis', nodes_cmap_label='Function', edges_width=0.5, edges_color='#888', edges_alpha=0.2, text=None, fpath=None)

Makes an interactive plot of fitness landscape with genotypes as nodes and single point mutations as edges using plotly

Parameters:
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)

pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed

edges_dfpd.DataFrame of shape (n_edges, 2)

pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.

xstr (‘1’)

Column in nodes_df to use for plotting the genotypes on the x-axis

ystr (‘2’)

Column in nodes_df to use for plotting the genotypes on the y-axis

zstr (None)

Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.

nodes_colorstr (‘function’)

Column name for the values according to which states will be colored or the specific color to use for plotting the states

nodes_sizefloat (2.5)

Size of the markers provided for plotting to axes.scatter. If a float is provided, that will be the size used to plot every nodes. If str, then node sizes will be scaled according to the corresponding column in nodes_df.

nodes_cmapcolormap or str

Colormap to use for coloring the nodes according to column color

nodes_cmap_labelstr

Label for colorbar

edges_widthfloat or str

Width of the lines representing the edges. If a float is provided, that will be the width used to plot every edges. If str, then widths will be scaled according to the corresponding column in edges_df.

edges_colorstr

Column name for the values according to which edges will be colored or the specific color to use for plotting the edges

edges_alphafloat (0.2)

Transparency of lines representing the edges

textarray-like of shape (nodes_df.shape[0]) (None)

Labels to show for each state when hovering over the markers representing them. If not provided, rownames of the nodes_df DataFrame will be used

fpathstr

File path in which to store the interactive plot as an html file

gpmap.plot.ds.plot_visualization(nodes_df, x='1', y='2', edges_df=None, nodes_color='function', nodes_cmap='viridis', nodes_size=5, nodes_vmin=None, nodes_vmax=None, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=False, edges_width=0.5, edges_alpha=1, edges_color='grey', edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, shade_nodes=True, shade_edges=True, square=True)
gpmap.plot.ds.figure_allele_grid(nodes_df, fpath, x='1', y='2', edges_df=None, positions=None, position_labels=None, edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, sort_by=None, sort_ascending=False, fmt='png', figsize=None, square=True, **kwargs)
gpmap.plot.mpl.plot_SeqDEFT_summary(log_Ls, seq_density=None, err_bars='stderr', show_folds=False, legend_loc=1, normalize_logL=True)

Generates a 2 panel figure showing how the cross-validated likelihood changes with a hyperparameter and the best selected value for model fitting.

Parameters:
log_Lspd.DataFrame of shape (num_a, 3)

DataFrame containing the column names a, logL and fold`

seq_densitypd.DataFrame of shape (n_genotypes, >= 2)

DataFrame with column names frequency, Q with the observed frequencies and estimated densities for each possible sequence respectively. If not provided only a 1 panel figure with the cross-validated likelihood curve will be provided

err_barsstr

What to show in the error bars: sd standard deviation across the different folds or stderr for standard error of the mean

show_folds: bool

Whether to show the out of sample log likelihoods for the different folds in the cross-validation procedure separately

Returns:
figmatplotlib.figure object

Figure object containing the resulting plots

Datasets

gpmap.datasets.list_available_datasets()

Returns a list with the names of all available built-in datasets

class gpmap.datasets.DataSet(dataset_name, data=None, landscape=None)

DataSet object that allows convenient manipulation of the different objets related with a given dataset. This includes the original data, the reconstructed landscape, visualization coordinates

Parameters:
dataset_namestr

Name of the dataset to load from the built-in list. If data or landscape are provided, it will be the name given to the new dataset

data: pd.DataFrame of shape (n_obs, n_features)

Dataframe containing the experimental data using genotypes as index

landscape: pd.DataFrame of shape (n_genotypes, 1)

Dataframe containing the complete combinatorial landscape from which to build the remaining objects of the dataset

Attributes:
data
edges
landscape
nodes
relaxation_times

Methods

calc_visualization

plot

save

to_sequence_space