API Reference

Discrete spaces

class gpmap.space.DiscreteSpace(adjacency_matrix, y=None, state_labels=None)

Class to define an arbitrary discrete space characterized uniquely by the connectivity between the different states and optionally by the function e.g. fitness or energy at each state of the discrete space

Parameters:

adjacency_matrix: scipy.sparse.csr_matrix of shape (n_states, n_states): Sparse matrix representing the adjacency relationships between states. The ij’th entry contains a 1 if the states i and j are connected and 0 otherwise
y: array-like of shape (n_states,): Quantitative property associated to each state
state_labels: array-like of shape (n_genotypes, ): State labels in the sequence space

Attributes:

n_states: int: Number of states in the discrete space
state_labels: array-like of shape (n_genotypes, ): State labels in the sequence space
state_idxs: pd.Series of shape (n_genotypes, ): pd.Series containing the index of each state. It has state_labels as index of the Series and can be used to quickly extract the index corresponding to a set of state labels
is_regular: bool: Boolean variable storing whether the resulting graph is regular or not, this is, whether each node has the same number of neighbors

Methods

`get_neighbor_pairs`()	Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
`get_neighbors`(states[, max_distance])	Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
`get_state_idxs`(states)	Returns the indexes for the provided state labels

get_edges_df

get_neighbor_pairs(): Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace

get_neighbors(states, max_distance=1)

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

Parameters:

statesarray-like of shape (state_number,): np.array or list of states from which to select the neighbors
max_distanceint (1): The maximal distance at which neighbors from the provided states will be returned

Returns:

neighbor_statesnp.array: Array containing the state labels in the d-neighborhood of `states

get_neighbors(states, max_distance=1)

Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance

Parameters:

statesarray-like of shape (state_number,): np.array or list of states from which to select the neighbors
max_distanceint (1): The maximal distance at which neighbors from the provided states will be returned

Returns:

neighbor_statesnp.array: Array containing the state labels in the d-neighborhood of `states

get_state_idxs(states): Returns the indexes for the provided state labels

class gpmap.space.ProductSpace(elementary_graphs, y=None, state_labels=None)

General class for spaces that can be built as cartesian products of smaller subspaces characterized by a set of elementary graphs

Parameters:

elementary_graphs: csr_matrices: List csr_matrices for the adjacency matrices from which to build the product space
y: None or array-like of shape (n,): np.array containing the phenotypic values associated to each combination of states in the resulting space. If y=None, no phenotypic values will be stored
state_labels: None or list: List with the labels associated to each of the possible states a in each of the l elements of the product space. If state_labels=None, numeric labels will be given by default.

Attributes:

is_regular: Attribute characterizing whether the space is regular, this is, every
n_edges

Methods

`get_neighbor_pairs`()	Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
`get_neighbors`(states[, max_distance])	Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
`get_state_idxs`(states)	Returns the indexes for the provided state labels

calc_adjacency_matrix
calc_states
format_list_ends
format_values
get_edges_df
get_y
init_space
set_dim_sizes
set_y
write_csv
write_edges

class gpmap.space.GridSpace(length, y=None, ndim=2)

Class for creating an N-dimensional grid discrete space

Parameters:

length: int or array-like: Number of states across each dimension of the grid. If an integer is provided, all dimensions of the grid will have the same length. If a series of lengths is provided, they will be used to form a grid of dimensions with the specified lengths and the ndim argument will be ignored
ndim: int: Number of dimensions in the grid with a single length value.
y: array-like of shape (length ** ndim,) or None: Phenotypic values associated to each possible state

Methods

set_peaks

class gpmap.space.SequenceSpace(X=None, y=None, seq_length=None, n_alleles=None, alphabet_type='dna', alphabet=None, stop_y=None)

Class for creating a Sequence space characterized by having sequences as states. States are connected in the discrete space if they differ by a single position in the sequence. It can be created in two different ways:

From a set of sequences and function values X, y

By specifying the properties of the sequence space (alphabet, sequence length, number of alleles per site and type of alphabet).

Parameters:

X: array-like of shape (n_genotypes,): Sequences to use as state labels of the discrete sequence space
y: array-like of shape (n_genotypes,): Quantitative phenotype or fitness associated to each genotype
seq_length: int (None): Length of the sequences in the sequence space. If not given, it will be guessed from alphabet or n_alleles
n_alleles: list of size `seq_length` (None): List containing the number of alleles present in each of the sites of the sequence space. It can only be specified for alphabet_type=custom
alphabet_type: str (‘dna’): Sequence type: {‘dna’, ‘rna’, ‘protein’, ‘custom’}
alphabet: list of `seq_length’ lists: Every element of the list is itself a list containing the different alleles allowed in each site. Note that the number and type of alleles can be different for every site.
stop_y: float (None): Value of the function given for protein sequence with an in-frame stop codon. If given, it will increase the protein alphabet to incorporate * for stops

Attributes:

n_genotypes: int: Number of states in the complete sequence space
genotypes: array-like of shape (n_genotypes, ): Genotype labels in the sequence space
adjacency_matrix: scipy.sparse.csr_matrix of shape: (n_genotypes, n_genotypes)

Sparse matrix representing the adjacency relationships between genotypes. The ij’th entry contains a 1 if the genotypes i and j are separated by a single mutation and 0 otherwise
y: array-like of shape (n_genotypes,): Quantitative phenotype or fitness associated to each genotype
is_regular: bool: Boolean variable storing whether the resulting Hamming graph is regular or not. In other words, whether every site has the same number of alleles

Methods

`get_single_mutant_matrix`(sequence[, center])	Returns the effects of single point mutations from a focal sequences
`remove_codon_incompatible_transitions`([...])	Recalculates the adjacency matrix of the discrete space to only allow transitions that are compatible with the specified codon table
`to_nucleotide_space`([codon_table, alphabet_type])	Transforms a protein space into a nucleotide space using a codon table for translating the sequence

get_single_mutant_matrix(sequence, center=False)

Returns the effects of single point mutations from a focal sequences

Parameters:

sequence: str: String encoding the sequence from which to report all single point mutant effects
center: bool (False): If True, results will be centered by position, so that the mean of allelic effects is 0. If False, the focal sequence will have 0 and values would represent mutational effects from it

Returns:

output: pd.DataFrame of shape (seq_length, total_alleles): pd.DataFrame containin the mutational or allelic effects for each allele across all sequence positions

remove_codon_incompatible_transitions(codon_table='Standard')

Recalculates the adjacency matrix of the discrete space to only allow transitions that are compatible with the specified codon table

Parameters:

codon_table: str or Bio.Data.CodonTable: NCBI code for an existing genetic code or a custom CodonTable object to translate nucleotide sequences into protein

to_nucleotide_space(codon_table='Standard', alphabet_type='dna')

Transforms a protein space into a nucleotide space using a codon table for translating the sequence

Parameters:

codon_table: str or Bio.Data.CodonTable: NCBI code for an existing genetic code or a custom CodonTable object to translate nucleotide sequences into protein
alphabet_type: str (‘dna’): Sequence type to use in the resulting nucleotide space It can only take one of the following values {‘dna’, ‘rna’}

Returns:

SequenceSpace: Nucleotide sequence space with 4 alleles per site and 3 times the number of sites of the current space

class gpmap.space.HammingBallSpace(X0, X=None, y=None, d=None, n_alleles=None, alphabet_type='dna', alphabet=None)

Class for the space representing the Hamming ball around a target sequence up to a certain number of mutations from it.

Parameters:

X0: str: Focal sequence around which to build the Hamming ball space
X: array-like of shape (n_genotypes,): Sequences to use as state labels of the discrete sequence space
y: array-like of shape (n_genotypes,): Quantitative phenotype or fitness associated to each genotype
d: int (None): Maximum distance from the focal sequence to include in the space
n_alleles: list of size `seq_length` (None): List containing the number of alleles present in each of the sites of the sequence space. It can only be specified for alphabet_type=custom
alphabet_type: str (‘dna’): Sequence type: {‘dna’, ‘rna’, ‘protein’, ‘custom’}
alphabet: list of `seq_length’ lists: Every element of the list is itself a list containing the different alleles allowed in each site. Note that the number and type of alleles can be different for every site.

Attributes:

n_genotypes: int: Number of states in the complete sequence space
genotypes: array-like of shape (n_genotypes, ): Genotype labels in the sequence space
adjacency_matrix: scipy.sparse.csr_matrix of shape: (n_genotypes, n_genotypes)

Sparse matrix representing the adjacency relationships between genotypes. The ij’th entry contains a 1 if the genotypes i and j are separated by a single mutation and 0 otherwise
y: array-like of shape (n_genotypes,): Quantitative phenotype or fitness associated to each genotype
is_regular: bool: Boolean variable storing whether the resulting Hamming graph is regular or not. In other words, whether every site has the same number of alleles

Methods

`get_neighbor_pairs`()	Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
`get_neighbors`(states[, max_distance])	Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
`get_state_idxs`(states)	Returns the indexes for the provided state labels

calc_adjacency_matrix
calc_graph
calc_max_min_path
calc_n_paths
format_list_ends
format_values
get_edges_df
get_genotypes
get_y
init_space
set_alphabet_type
set_seq_length
set_y
write_csv
write_edges

Random walks

class gpmap.randwalk.WMWalk(space, log=None, Ns=None)

Class for Weak Mutation Weak Selection Random Walk on a SequenceSpace. It is a time-reversible continuous time Markov Chain where the transition rates depend on the differences in fitnesses between two states scaled by the effective population size Ns .

Attributes:

spaceDiscreteSpace class: Space on which the random walk takes place
Nsfloat: Scaled effective population size for the evolutionary model
rate_matrixcsr_matrix: Rate matrix defining the continuous time process

Methods

set_Ns():	Method to specify the scaled effective population size Ns, either directly or by specifying the mean function at stationarity or the percentile it represents from the distribution of functions across sequence space
calc_stationary_frequencies():	Calculates the stationary frequencies of the states under the random walk specified on the discrete space
calc_rate_matrix():	Calculates the rate matrix for the continuous time process given the scaled effective population size (Ns) or average phenotype at stationarity.

calc_neutral_mixing_rates(site_exchange_rates, neutral_site_freqs)

Calculates the neutral mixing rates for a SequenceSpace In case no GTR mutation model is specified, then the neutral mixing rates is limited by the site with the least number of alleles. Otherwise, as we assume that mutations are site-independent, the slowest neutral mixing rate is going to by limited by the slowest site, provided by the smallest of second eigenvalues in the site rate matrices

Parameters:

neutral_site_Qslist of array-like of shape (n_alleles, n_alleles): List containing site-specific rate matrices to use for calculating the limiting mixing in the neutral case. If not provided, uniform mutation rates are assumed.
neutral_site_freqslist of array-like of shape (n_alleles,): List containing vectors with the stationary frequencies under neutrality for each site. They are used to calculate the eigenvalues of the time reversible site specific neutral chain. By default, they are assumed to be uniform across sites and alleles.
site_weightsarray-like of shape (seq_length,): Vector containing the relative weight associated to each site. This value is used to scale the individually normalized rates matrices to ensure this specific leaving rate. By default, all weights are equal

Returns:

neutral_mixing_rate: float: Neutral mixing rate as the smallest second largest eigenvalue across sites.
TODO: Re-implement functionality

calc_rate_matrix(Ns=None, neutral_stat_freqs=None, neutral_exchange_rates=None)

Calculates the rate matrix for the random walk in the discrete space and stores it in the attribute rate_matrix

Parameters:

Nsreal: Scaled effective population size for the evolutionary model
neutral_stat_freqsarray-like of shape (n_states,): Genotype stationary frequencies at neutrality to define the time reversible neutral dynamics
neutral_exchange_rates: scipy.sparse.csr.csr_matrix of shape: (n_states, n_states)

Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.

calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=10, neutral_exchange_rates=None, neutral_stat_freqs=None, tol=1e-12)

Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist on the right eigenvectors of the associate rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time

Parameters:

Nsfloat: Scaled effective population size to use in the underlying evolutionary model
mean_functionfloat: Mean function at stationarity to derive the associated Ns
mean_function_perc: float: Percentile that the mean function at stationarity takes within the distribution of function values along sequence space e.g. if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values
n_components: int (10): Number of eigenvectors or diffusion axis to calculate
neutral_stat_freqsarray-like of shape (n_states,): Genotype stationary frequencies at neutrality to define the time reversible neutral dynamics
neutral_exchange_rates: scipy.sparse.csr.csr_matrix of: shape (n_states, n_states)

Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.

write_tables(prefix, write_edges=False, nodes_format='parquet', edges_format='npz')

Write the output of the visualization in tables with a common prefix. The output can consist in 2 to 3 different tables, as one of them may not be always necessarily stored multiple times

nodes coordinates : contains the coordinates for each state and

the associated function values and stationary frequencies. It is stored in CSV format with suffix “nodes.csv” or parquet with suffix “nodes.pq” - decay rates : contains the decay rates and relaxation times associated to each component or diffusion axis. It is stored in CSV format with suffix “decay_rates.csv” - edges : contains the adjacency relationship between states. It is not stored by default unless write_edges=True, as it will remain unchanged for any visualization on the same SequenceSpace. Therefore, so it only needs to be stored once. It can be stored in CSV format, or in the more efficent npz format for sparse matrices

Parameters:

prefix: str: Prefix of the files to store the different tables
write_edges: bool (False): Option to write also the information about the adjacency relationships between pairs for states for plotting the edges
nodes_format: str {‘parquet’, ‘csv’}: Format to store the nodes information. parquet is more efficient but CSV can be used in smaller cases for plain text storage.
edges_format: str {‘npz’, ‘csv’}: Format to store the edges information. npz is more efficient but CSV can be used in smaller cases for plain text storage.

Landscape Inference

class gpmap.inference.MinimumEpistasisInterpolator(P=2, n_alleles=None, seq_length=None, alphabet_type='custom', cg_rtol=1e-16)

Methods

predict()

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

smooth

predict()

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

Returns:

functionpd.DataFrame of shape (n_genotypes, 1): Returns the phenotypic predictions for each input genotype in the column ypred and genotype labels as row names. If calc_variance=True, then it has an additional column with the posterior variances for each genotype

class gpmap.inference.MinimumEpistasisRegression(P, a=None, n_alleles=None, seq_length=None, alphabet_type='custom', nfolds=5, num_reg=20, min_log_reg=-2, max_log_reg=6, progress=True, cg_rtol=0.0001)

Methods

`fit`(X, y[, y_var, cross_validation])	Infers the optimal a from the provided data, this is, the magnitude of Pth order local epistatic coefficients that maximize predictive performance in held out data
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

fit(X, y, y_var=None, cross_validation=False)

Infers the optimal a from the provided data, this is, the magnitude of Pth order local epistatic coefficients that maximize predictive performance in held out data

Parameters:

Xarray-like of shape (n_obs,): Vector containing the genotypes for which have observations provided by y
yarray-like of shape (n_obs,): Vector containing the observed phenotypes corresponding to X sequences
y_vararray-like of shape (n_obs,): Vector containing the empirical or experimental known variance for the measurements in y

Returns:

afloat: Optimal a value maximing the cross-validated log-likelihood

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

Parameters:

contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts): DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution

Returns:

contrasts: pd.DataFrame of shape (n_contrasts, 5): DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.

class gpmap.inference.VCregression(lambdas=None, n_alleles=None, seq_length=None, alphabet_type='custom', beta=0, cross_validation=False, nfolds=5, cv_loss_function='frobenius_norm', num_beta=20, min_log_beta=-2, max_log_beta=7, cg_rtol=1e-16, progress=True)

Variance Component regression model that allows inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior parametrized by the contribution of the different orders of interaction to the observed genetic variability of a continuous phenotype

It requires the use of the same number of alleles per sites

Methods

`fit`(X, y[, y_var])	Infers the variance components from the provided data, this is, the relative contribution of the different orders of interaction to the variability in the sequence-function relationships
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
`predict`([X_pred, calc_variance])	Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes
`simulate`([sigma, p_missing])	Simulates data under the specified Variance component priors

lambdas_to_variance

fit(X, y, y_var=None)

Infers the variance components from the provided data, this is, the relative contribution of the different orders of interaction to the variability in the sequence-function relationships

Stores learned lambdas in the attribute VCregression.lambdas to use internally for predictions and returns them as output

Parameters:

Xarray-like of shape (n_obs,): Vector containing the genotypes for which have observations provided by y
yarray-like of shape (n_obs,): Vector containing the observed phenotypes corresponding to X sequences
y_vararray-like of shape (n_obs,): Vector containing the empirical or experimental known variance for the measurements in y

Returns:

lambdas: array-like of shape (seq_length + 1,): Variances for each order of interaction k inferred from the data

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

Parameters:

contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts): DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution

Returns:

contrasts: pd.DataFrame of shape (n_contrasts, 5): DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.

predict(X_pred=None, calc_variance=False)

Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes

Parameters:

X_predarray-like of shape (n_genotypes,): Vector containing the genotypes for which we want to predict the phenotype. If n_genotypes == None then predictions are provided for the whole sequence space
calc_variancebool (False): Option to also return the posterior variances for each individual genotype

Returns:

functionpd.DataFrame of shape (n_genotypes, 1): Returns the phenotypic predictions for each input genotype in the column ypred and genotype labels as row names. If calc_variance=True, then it has an additional column with the posterior variances for each genotype

simulate(sigma=0, p_missing=0)

Simulates data under the specified Variance component priors

Parameters:

sigmareal: Standard deviation of the experimental noise additional to the variance components
p_missingfloat between 0 and 1: Probability of randomly missing genotypes in the simulated output data

Returns:

datapd.DataFrame of shape (n_genotypes, 3): DataFrame with the columns y_true, y``and ``var corresponding to the true function at each genotype, the observed values and the variance of the measurement respectively for each sequence or genotype indicated in the DataFrame.index

class gpmap.inference.SeqDEFT(P, n_alleles=None, seq_length=None, alphabet_type='custom', genotypes=None, a=None, num_reg=20, nfolds=5, lambdas_P_inv=None, a_resolution=0.1, max_a_max=1000000000000.0, fac_max=0.1, fac_min=1e-06, optimization_opts={}, maxiter=10000, gtol=1e-06, ftol=1e-08)

Sequence Density Estimation using Field Theory model that allows inference of a complete sequence probability distribution under a Gaussian Process prior parameterized by variance of local epistatic coefficients of order P

It requires the use of the same number of alleles per sites

Parameters:

Pint: Order of the local interaction coefficients that we are penalized under the prior i.e. P=2 penalizes local pairwise interaction across all posible faces of the Hamming graph while P=3 penalizes local 3-way interactions across all possible cubes.
afloat (None): Parameter related to the inverse of the variance of the P-order epistatic coefficients that are being penalized. Larger values induce stronger penalization and approximation to the Maximum-Entropy model of order P-1. If a=None the best a is found through cross-validation
num_regint (20): Number of a values to evaluate through cross-validation
nfolds: int (5): Number of folds to use in the cross-validation procedure

Methods

`fit`(X[, y, baseline_phi, baseline_X, ...])	Infers the sequence-function relationship under the specified Delta^{(P)} prior
`make_contrasts`(contrast_matrix)	Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
`simulate`(N[, phi, seed])	Simulates data under the specified a penalization for local P-epistatic coefficients
`simulate_phi`()	Simulates data under the specified a penalization for local P-epistatic coefficients

fit(X, y=None, baseline_phi=None, baseline_X=None, positions=None, phylo_correction=False, adjust_freqs=False, allele_freqs=None)

Infers the sequence-function relationship under the specified Delta^{(P)} prior

Parameters:

Xarray-like of shape (n_obs,): Vector containing the observed sequences
yarray-like of shape (n_obs,): Vector containing the weights for each observed sequence. By default, each sequence takes a weight of 1. These weights can be calculated using phylogenetic correction
baseline_X: array-like of shape (n_genotypes,): Vector containing the sequences associated with baseline_phi
baseline_phi: array-like of shape (n_genotypes,): Vector containing the baseline_phi to include in the model
positions: array-like of shape (n_pos,): If provided, subsequences at these positions in the provided input sequences will be used as input
phylo_correction: bool (False): Apply phylogenetic correction using the full length sequences
adjust_freqs: bool (False): Whether to correct densities by the expected allele frequencies in the full length sequences
allele_freqs: dict or codon_table: Dictionary containing the allele expected frequencies frequencies for every allele in the set of possible sequences or the codon table to use to genereate expected aminoacid frequencies If None, they will be calculated from the full length observed sequences.

make_contrasts(contrast_matrix)

Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.

Parameters:

contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts): DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution

Returns:

contrasts: pd.DataFrame of shape (n_contrasts, 5): DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.

simulate(N, phi=None, seed=None)

Simulates data under the specified a penalization for local P-epistatic coefficients

Parameters:

Nint: Number of total sequences to sample
phiarray-like of shape (n_genotypes,): Vector containing values for the field underlying the probability distribution from which to sample sequences. If provided, they will be used instead of sampling them from the prior characterized by the given a.
seed: int (None): Random seed to use for simulation

Returns:

Xarray-like of shape (N,): Vector containing the sampled sequences from the probability distribution

simulate_phi()

Simulates data under the specified a penalization for local P-epistatic coefficients

Returns:

phiarray-like of shape (n_genotypes,): Vector containing values for the latent phenotype or field sampled from the prior characterized by a

Sequence utils

gpmap.seq.guess_space_configuration(seqs, ensure_full_space=True, force_regular=False, force_regular_alleles=False)

late Guess the sequence space configuration from a collection of sequences This allows to have different number of alleles per site and maintain the order in which alleles appear in the sequences when enumerating the alleles per position

Parameters:

seqs: array-like of shape (n_genotypes,): Vector or list containing the sequences from which we want to infer the space configuration
ensure_full_space: bool: Option to ensure that the whole sequence space must be represented by the set of provided sequences. This is a useful feature to identify whether there are missing genotypes before defining the space and random walk to visualize the full landscape.
force_regular: bool: Option to ensure that there are the same number of alleles per site. New allele names will be added to sites with less than the maximum number of alleles across sites
force_regular_alleles: bool: Option to additionally ensure that the same alleles are common across all sites

Returns:

config: dict with keys {‘length’, ‘n_alleles’, ‘alphabet’}: Returns a dictionary with the inferred configuration of the discrete space where the sequences come from.

gpmap.seq.get_custom_codon_table(aa_mapping)

Builds a biopython CodonTable to use for translation with a custom genetic code

Parameters:

aa_mapping: pd.DataFrame: pandas DataFrame with columns “Codon” and “Letter” representing the genetic code correspondence. Stop codons should appear as “*”

Returns:

codon_table: Bio.Data.CodonTable.CodonTable object: Standard bioptython codon table object to use for translating sequences

gpmap.seq.get_one_hot_from_alleles(alphabet)

Returns a one hot encoding CSR matrix for a complete combinatorial space It uses a fast recursive method to avoid repetition of building common blocks in the full matrix

Parameters:

alphabetlist of list: List containing lists of alleles per site in a sequence space
Returnsscipy.sparse.csr_matrix of shape (n_genotypes, total_n_alleles): csr matrix containing the one hot encoding of the full sequence space as with genotypes sorted lexicographically

gpmap.seq.get_alphabet(n_alleles=None, alphabet_type=None)

Returns the resulting alphabet from specifying either the type or the number of alleles per site

Parameters:

n_allelesint: Number of alleles per site
alphabet_typestr: Type of alphabet to use out of {None, ‘dna’, ‘rna’, ‘protein’}

Returns:

alphabetlist: List containing the alleles in the desired alphabet

gpmap.seq.generate_freq_reduced_code(seqs, n_alleles, counts=None, keep_allele_names=True, last_character='X')

Returns a list of dictionaries with the mapping from each allele in the observed sequences to a reduced alphabet with at most n_alleles per site. The least frequent alleles are pooled together into a single allele

Parameters:

seqsarray-like of shape (n_genotypes,) or (n_obs,): Observed sequences. If counts=None, then every sequence is counted once. Otherwise, frequencies are calculated using the counts as the number of times a certain sequence appears in the data
n_allelesint or array-like of shape (seq_length, ): Maximal number of alleles per site allowed. If a list or array is provided each site will use the specified number of alleles. Otherwise, all sites will have the same maximum number of alleles
countsNone or array-like of shape (n_genotypes, ): Number of times every sequence in seqs appears in the data. If not provided, every provided sequence is assumed to appear exactly once
keep_allele_namesbool: If keep_allele_names=True, then allele names are preserved. Otherwise they are replace by new alleles taken from the alphabet
last_characterstr: Character to use for remaining alleles when keep_allele_names=True

Returns:

codelist of dict of length seq_length: List of dictionaries containing the new allele corresponding to each of the original alleles for each site.

gpmap.seq.transcribe_seqs(seqs, code)

gpmap.seq.translate_seqs(seqs, codon_table='Standard')

gpmap.seq.msa_to_counts(X, y=None, positions=None, phylo_correction=False, max_dist=0.2)

Obtains a series of sequences and their counts from a Multiple Sequence Alignment (MSA) provided as a list of sequences. It can select subsequences by selecting which positions to look at in the MSA and do sequence identity re-weighting by considering the sequence similarities across the full length sequence

Parameters:

Xarray-like of aligned sequences: Input sequences from which to extract counts
yarray-like of weights (None): Pre-calculated weights associated to the input sequences
positionsarray-like of positions (None): If provided, subsequences at this subset of positions will be used to provide counts or re-weighted counts
phylo_correctionbool (False): If True, observations will be re-weighted using sequence similarity along the whole sequence as 1 over the number of similar sequences in the MSA. Similar sequences are defined as those that differ less from each other than the specified `max_dist`
max_distfloat (0.2): Pairs of sequences that differ more than this value will be consired similar for re-weighting

Returns:

X: np.array of shape (n_unique_seqs, ): Unique subsequences at the specified positions in the MSA
y: np.array of shape (n_unique_seqs, ): Counts or re-weighted counts for each of the unique subsequences in the MSA

Genotypes handling

gpmap.utils.read_dataframe(fpath)

gpmap.utils.read_edges(fpath, log=None, return_df=True)

Reads the incidence matrix containing the adjacency information among genotypes from a sequence space

Parameters:

fpathstr: File path containing the edges of a sequence space. The extension will be used to differentiate between csv and the more efficient npz format
return_dfbool (True): Whether to return a pd.DataFrame with the edges. Alternatively it will return a csr_matrix

Returns:

edges_dfpd.DataFrame of shape (n_edges, 2) or csr_matrix: DataFrame with column names i and j containing the indexes of the genotypes that are separated by a single mutation in a sequence space

gpmap.genotypes.select_genotypes(nodes_df, genotypes, edges=None, is_idx=False)

Selects the provided genotypes from nodes_df with the corresponding edges among the remaining genotypes if edges are provided

Parameters:

nodes_df: pd.DataFrame of shape (n_genotypes, n_features): DataFrame with the genotypes from a full sequence space as index Typically, it will contain, at least, the coordinates of the visualization for each genotype, but it will keep any other column in the DataFrame for later use
genotypes: array-like of shape (n_genotypes,): Array of ordered genotypes to select from the starting landscape It should contain the genotype labels by default, or indexes if option is_idx is provided
edges: pd.DataFrame of shape (n_edges, 2) or scipy.sparse.csr_matrix: of shape (n_genotypes, n_genotypes)

DataFrame or csr_matrix containing the adjacency relationships among genotypes provided in nodes_df in the discrete space
is_idx: bool: The genotypes argument is an array of indexes instead of an array of genotype labels to select genotypes

Returns:

output: (nodes_df, edges): Filtered landscape containing the selected genotypes and the adjacency relationships between them given as a tuple

gpmap.genotypes.get_genotypes_from_region(nodes_df, max_values={}, min_values={})

Returns the genotype labels matching the specified conditions as maximum and minimum values of the dataframe

Parameters:

nodes_df: pd.DataFrame of shape (n_genotypes, n_features): DataFrame with the genotypes from a full sequence space as index Typically, it will contain, at least, the coordinates of the visualization for each genotype, but it will keep any other column in the DataFrame for later use
max_valuesdict: Dictionary with column names as keys and max values to filter genotypes as values
min_valuesdict: Dictionary with column names as keys and min values to filter genotypes as values

Returns:

genotypesarray-like of shape (n_selected,): Array containing the selected genotypes from the input dataframe

gpmap.genotypes.marginalize_landscape_positions(nodes_df, keep_pos=None, skip_pos=None, return_edges=False)

Averages out some positions in the sequences for all numeric values provided in the input dataframe

Parameters:

nodes_dfpd.DataFrame: DataFrame with sequence names as index and at least one numeric column to calculate the average across the selected backgrounds
keep_posarray-like (None): If provided, list of 0-index positions that are to be preserved and averaged across all genetic backgrounds specified by the remaining positions
skip_posarray-like (None): If provided, list of 0-index positions to average out
return_edgesbool (False): Return also an edges_df DataFrame to use directly for visualization

Returns:

nodes_dfpd.DataFrame: DataFrame containing the average value of every numeric column in the input DataFrame with the subsequences at the desired positions as index
edges_dfpd.DataFrame: DataFrame containing the edges of the reduced sequence space. It will only be provided if `return_edges=True`

Plotting

gpmap.plot.mpl.plot_relaxation_times(decay_df, axes=None, fpath=None, log_scale=False, neutral_time=None, kwargs={})

Plots the relaxation times associated to each of the calculated components from using ``WMWalk.calc_visualization

Parameters:

decay_dfpd.DataFrame of shape (n_components, 3): pd.DataFrame containing the decay rates and the associated mean relaxation times for each of the calculated components
axesmatplotlib Axes object (None): Axes where to plot. If not provided, a new figure will be created automatically for this plot and save in the path provided by fpath
fpathstr (None): File path to store the plot. If fpath=None, axes argument must be provided for plotting.
log_scalebool (False): Plot the relaxation times in log scale
neutral_timefloat (None): If provided, an additional horizontal line will be plotted representing the relaxation time associated to the neutral process. This is useful when selecting the number of relevant dimensions to plot
kwargsdict: Additional key-word arguments dictionary provided for axes.plot and axes.scatter e.g. color.

gpmap.plot.mpl.plot_edges(axes, nodes_df, edges_df, x='1', y='2', z=None, alpha=0.1, zorder=1, color='grey', cbar=True, cmap='binary', cbar_axes=None, cbar_orientation='vertical', cbar_label='', palette=None, legend=True, legend_loc=0, width=0.5, max_width=1, min_width=0.1, fontsize=None)

Plots the edges representing the connections between states that are conneted in the discrete space under a particular embedding

Parameters:

axesmatplotlib Axes in which to plot the edges.
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2): pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed
edges_dfpd.DataFrame of shape (n_edges, 2): pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.
xstr (‘1’): Column in nodes_df to use for plotting the genotypes on the x-axis
ystr (‘2’): Column in nodes_df to use for plotting the genotypes on the y-axis
zstr (None): Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.
alphafloat (0.1): Transparency of lines representing the edges
zorderint (1): Order in which the edges will be rendered relative to other elements. Generally, we would want this to be smaller than the zorder used for plotting the nodes
colorstr (‘grey’): Column name for the values according to which edges will be colored or the specific color to use for plotting the edges
cmapcolormap or str: Colormap to use for coloring the edges according to column color
widthfloat or str: Width of the lines representing the edges. If a float is provided, that will be the width used to plot every edges. If str, then widths will be scaled according to the corresponding column in edges_df.
max_widthfloat (1): Maximum linewidth for the edges when scaled by
min_widthfloat (0.1): Maximum linewidth for the edges when scaled by

Returns:

line_collectionLineCollection or Line3DCollection

gpmap.plot.mpl.plot_nodes(axes, nodes_df, x='1', y='2', z=None, alpha=1, zorder=2, sort_by=None, sort_ascending=False, color='function', cmap='viridis', cbar=True, cbar_axes=None, cbar_label='Function', cbar_orientation='vertical', vcenter=None, vmax=None, vmin=None, palette='Set1', size=2.5, max_size=40, min_size=1, lw=0, edgecolor='black', legend=True, legend_loc=0)

Plots the nodes representing the states of the discrete space on the provided coordinates

Parameters:

axesmatplotlib Axes in which to plot the nodes or states.
nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2): pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed
xstr (‘1’): Column in nodes_df to use for plotting the genotypes on the x-axis
ystr (‘2’): Column in nodes_df to use for plotting the genotypes on the y-axis
zstr (None): Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.
alphafloat (1): Transparency of markers representing the nodes
zorderint (2): Order in which the nodes will be rendered relative to other elements. Generally, we would want this to be bigger than the zorder used for plotting the edges
colorstr (‘grey’): Column name for the values according to which states will be colored or the specific color to use for plotting the states
vcenterbool (False): Center the color scale around the 0 value
vmaxfloat: Maximum value to show in the colormap
vminfloat: Minimum value to show in the colormap
cmapcolormap or str: Colormap to use for coloring the nodes according to column color
cbarbool: Boolean variable representing whether to show the colorbar
cbar_labelstr: Label for the colorbar associated to the nodes color scale
cbar_axesmatplotlib Axes: Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes
palettedict: Dictionary containing the colors associated to the categories specified by the column color in nodes_df, if they express categories rather than numerical values
sizefloat (2.5): Size of the markers provided for plotting to axes.scatter. If a float is provided, that will be the size used to plot every nodes. If str, then node sizes will be scaled according to the corresponding column in nodes_df.
max_sizefloat (1): Maximum linewidth for the edges when scaled by
min_sizefloat (0.1): Maximum linewidth for the edges when scaled by
lwfloat (0): Width of the line edges delimiting the markers representing the nodes
edgecolorstr (‘black’): Color of the line edges delimiting the markers representing the nodes
legend: bool (True): Show legend on the plot
legend_locint or tuple: Location of the legend in case of coloring according to a categoric variable

Returns:

line_collectionLineCollection or Line3DCollection

gpmap.plot.mpl.plot_visualization(axes, nodes_df, edges_df=None, x='1', y='2', z=None, nodes_alpha=1, nodes_zorder=2, nodes_color='function', nodes_cmap='viridis', nodes_palette=None, nodes_vmin=None, nodes_vmax=None, nodes_vcenter=False, nodes_cbar=True, nodes_cbar_axes=None, nodes_cmap_label='Function', nodes_size=2.5, nodes_min_size=1, nodes_max_size=40, nodes_lw=0, nodes_edgecolor='black', edges_alpha=0.1, edges_zorder=1, edges_color='grey', edges_cmap='binary', edges_palete=None, edges_cbar=False, edges_cbar_axes=None, edges_width=0.5, edges_max_width=1, edges_min_width=0.1, sort_by=None, sort_ascending=True, center_spines=False, add_hist=False, inset_cbar=False, inset_pos=(0.7, 0.7), prev_nodes_df=None)

Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states that are conneted if provided

Parameters:

axesmatplotlib: matplotlib Axes in which to plot the edges.
pd.DataFrame of shape (n_genotypes, n_variables): pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed
pd.DataFrame of shape (n_edges, 2): pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.
xstr, optional: _description_, by default ‘1’
ystr, optional: _description_, by default ‘2’
z_type_, optional: _description_, by default None
nodes_alphaint, optional: _description_, by default 1
nodes_zorderint, optional: _description_, by default 2
nodes_colorstr, optional: _description_, by default ‘function’
nodes_cmapstr, optional: _description_, by default ‘viridis’
nodes_palette_type_, optional: _description_, by default None
nodes_vmin_type_, optional: _description_, by default None
nodes_vmax_type_, optional: _description_, by default None
nodes_vcenterbool, optional: _description_, by default False
nodes_cbarbool, optional: _description_, by default True
nodes_cbar_axes_type_, optional: _description_, by default None
nodes_cmap_labelstr, optional: _description_, by default ‘Function’
nodes_sizefloat, optional: _description_, by default 2.5
nodes_min_sizeint, optional: _description_, by default 1
nodes_max_sizeint, optional: _description_, by default 40
nodes_lwint, optional: _description_, by default 0
nodes_edgecolorstr, optional: _description_, by default ‘black’
edges_alphafloat, optional: _description_, by default 0.1
edges_zorderint, optional: _description_, by default 1
edges_colorstr, optional: _description_, by default ‘grey’
edges_cmapstr, optional: _description_, by default ‘binary’
edges_palete_type_, optional: _description_, by default None
edges_cbarbool, optional: _description_, by default False
edges_cbar_axes_type_, optional: _description_, by default None
edges_widthfloat, optional: _description_, by default 0.5
edges_max_widthint, optional: _description_, by default 1
edges_min_widthfloat, optional: _description_, by default 0.1
sort_by_type_, optional: _description_, by default None
sort_ascendingbool, optional: _description_, by default False
center_spinesbool, optional: _description_, by default False
add_histbool, optional: _description_, by default False
inset_cbarbool, optional: _description_, by default False
inset_postuple, optional: _description_, by default (0.7, 0.7)
prev_nodes_df_type_, optional: _description_, by default None

gpmap.plot.mpl.figure_Ns_grid(rw, x='1', y='2', pmin=0, pmax=0.8, ncol=4, nrow=3, show_edges=True, fpath=None, **kwargs)

gpmap.plot.mpl.figure_allele_grid(nodes_df, edges_df=None, allele_color='orange', background_color='lightgrey', positions=None, position_labels=None, colsize=3, rowsize=2.7, xpos_label=0.05, ypos_label=0.92, fmt='png', fpath=None, **kwargs)

gpmap.plot.ply.plot_visualization(nodes_df, edges_df=None, x='1', y='2', z=None, nodes_color='function', nodes_size=4, nodes_cmap='viridis', nodes_cmap_label='Function', edges_width=0.5, edges_color='#888', edges_alpha=0.2, text=None, fpath=None)

Makes an interactive plot of fitness landscape with genotypes as nodes and single point mutations as edges using plotly

Parameters:

nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2): pd.DataFrame containing the coordinates in every of the n_components in addition to the “function” and “stationary_freq” columns. Additional columns are also allowed
edges_dfpd.DataFrame of shape (n_edges, 2): pd.DataFrame the connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.
xstr (‘1’): Column in nodes_df to use for plotting the genotypes on the x-axis
ystr (‘2’): Column in nodes_df to use for plotting the genotypes on the y-axis
zstr (None): Column in nodes_df to use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the provided axes object allows it.
nodes_colorstr (‘function’): Column name for the values according to which states will be colored or the specific color to use for plotting the states
nodes_sizefloat (2.5): Size of the markers provided for plotting to axes.scatter. If a float is provided, that will be the size used to plot every nodes. If str, then node sizes will be scaled according to the corresponding column in nodes_df.
nodes_cmapcolormap or str: Colormap to use for coloring the nodes according to column color
nodes_cmap_labelstr: Label for colorbar
edges_widthfloat or str: Width of the lines representing the edges. If a float is provided, that will be the width used to plot every edges. If str, then widths will be scaled according to the corresponding column in edges_df.
edges_colorstr: Column name for the values according to which edges will be colored or the specific color to use for plotting the edges
edges_alphafloat (0.2): Transparency of lines representing the edges
textarray-like of shape (nodes_df.shape[0]) (None): Labels to show for each state when hovering over the markers representing them. If not provided, rownames of the nodes_df DataFrame will be used
fpathstr: File path in which to store the interactive plot as an html file

gpmap.plot.ds.plot_visualization(nodes_df, x='1', y='2', edges_df=None, nodes_color='function', nodes_cmap='viridis', nodes_size=5, nodes_vmin=None, nodes_vmax=None, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=False, edges_width=0.5, edges_alpha=1, edges_color='grey', edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, shade_nodes=True, shade_edges=True, square=True)

gpmap.plot.ds.figure_allele_grid(nodes_df, fpath, x='1', y='2', edges_df=None, positions=None, position_labels=None, edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, sort_by=None, sort_ascending=False, fmt='png', figsize=None, square=True, **kwargs)

gpmap.plot.mpl.plot_SeqDEFT_summary(log_Ls, seq_density=None, err_bars='stderr', show_folds=False, legend_loc=1, normalize_logL=True)

Generates a 2 panel figure showing how the cross-validated likelihood changes with a hyperparameter and the best selected value for model fitting.

Parameters:

log_Lspd.DataFrame of shape (num_a, 3): DataFrame containing the column names a, logL and fold`
seq_densitypd.DataFrame of shape (n_genotypes, >= 2): DataFrame with column names frequency, Q with the observed frequencies and estimated densities for each possible sequence respectively. If not provided only a 1 panel figure with the cross-validated likelihood curve will be provided
err_barsstr: What to show in the error bars: sd standard deviation across the different folds or stderr for standard error of the mean
show_folds: bool: Whether to show the out of sample log likelihoods for the different folds in the cross-validation procedure separately

Returns:

figmatplotlib.figure object: Figure object containing the resulting plots

Datasets

gpmap.datasets.list_available_datasets(): Returns a list with the names of all available built-in datasets

class gpmap.datasets.DataSet(dataset_name, data=None, landscape=None)

DataSet object that allows convenient manipulation of the different objets related with a given dataset. This includes the original data, the reconstructed landscape, visualization coordinates

Parameters:

dataset_namestr: Name of the dataset to load from the built-in list. If data or landscape are provided, it will be the name given to the new dataset
data: pd.DataFrame of shape (n_obs, n_features): Dataframe containing the experimental data using genotypes as index
landscape: pd.DataFrame of shape (n_genotypes, 1): Dataframe containing the complete combinatorial landscape from which to build the remaining objects of the dataset

Attributes:

data
edges
landscape
nodes
relaxation_times

Methods

calc_visualization
plot
save
to_sequence_space