API Reference
Discrete spaces
- class gpmap.space.DiscreteSpace(adjacency_matrix, y=None, state_labels=None)
Class to define an arbitrary discrete space characterized uniquely by the connectivity between the different states and optionally by the function e.g. fitness or energy at each state of the discrete space
- Parameters:
- adjacency_matrix: scipy.sparse.csr_matrix of shape (n_states, n_states)
Sparse matrix representing the adjacency relationships between states. The ij’th entry contains a 1 if the states i and j are connected and 0 otherwise
- y: array-like of shape (n_states,)
Quantitative property associated to each state
- state_labels: array-like of shape (n_genotypes, )
State labels in the sequence space
- Attributes:
- n_states: int
Number of states in the discrete space
- state_labels: array-like of shape (n_genotypes, )
State labels in the sequence space
- state_idxs: pd.Series of shape (n_genotypes, )
pd.Series containing the index of each state. It has state_labels as index of the Series and can be used to quickly extract the index corresponding to a set of state labels
- is_regular: bool
Boolean variable storing whether the resulting graph is regular or not, this is, whether each node has the same number of neighbors
Methods
Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
get_neighbors(states[, max_distance])Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
get_state_idxs(states)Returns the indexes for the provided state labels
get_edges_df
- get_neighbor_pairs()
Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
- get_neighbors(states, max_distance=1)
Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
- Parameters:
- statesarray-like of shape (state_number,)
np.array or list of states from which to select the neighbors
- max_distanceint (1)
The maximal distance at which neighbors from the provided states will be returned
- Returns:
- neighbor_statesnp.array
Array containing the state labels in the d-neighborhood of `states
- get_neighbors(states, max_distance=1)
Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
- Parameters:
- statesarray-like of shape (state_number,)
np.array or list of states from which to select the neighbors
- max_distanceint (1)
The maximal distance at which neighbors from the provided states will be returned
- Returns:
- neighbor_statesnp.array
Array containing the state labels in the d-neighborhood of `states
- get_state_idxs(states)
Returns the indexes for the provided state labels
- class gpmap.space.ProductSpace(elementary_graphs, y=None, state_labels=None)
General class for spaces that can be built as cartesian products of smaller subspaces characterized by a set of elementary graphs
- Parameters:
- elementary_graphs: csr_matrices
List csr_matrices for the adjacency matrices from which to build the product space
- y: None or array-like of shape (n,)
np.array containing the phenotypic values associated to each combination of states in the resulting space. If y=None, no phenotypic values will be stored
- state_labels: None or list
List with the labels associated to each of the possible states a in each of the l elements of the product space. If state_labels=None, numeric labels will be given by default.
- Attributes:
is_regularAttribute characterizing whether the space is regular, this is, every
- n_edges
Methods
get_neighbor_pairs()Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
get_neighbors(states[, max_distance])Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
get_state_idxs(states)Returns the indexes for the provided state labels
calc_adjacency_matrix
calc_states
format_list_ends
format_values
get_edges_df
get_y
init_space
set_dim_sizes
set_y
write_csv
write_edges
- class gpmap.space.GridSpace(length, y=None, ndim=2)
Class for creating an N-dimensional grid discrete space
- Parameters:
- length: int or array-like
Number of states across each dimension of the grid. If an integer is provided, all dimensions of the grid will have the same length. If a series of lengths is provided, they will be used to form a grid of dimensions with the specified lengths and the ndim argument will be ignored
- ndim: int
Number of dimensions in the grid with a single length value.
- y: array-like of shape (length ** ndim,) or None
Phenotypic values associated to each possible state
Methods
set_peaks
- class gpmap.space.SequenceSpace(X=None, y=None, seq_length=None, n_alleles=None, alphabet_type='dna', alphabet=None, stop_y=None)
Class for creating a Sequence space characterized by having sequences as states. States are connected in the discrete space if they differ by a single position in the sequence. It can be created in two different ways:
From a set of sequences and function values X, y
By specifying the properties of the sequence space (alphabet, sequence length, number of alleles per site and type of alphabet).
- Parameters:
- X: array-like of shape (n_genotypes,)
Sequences to use as state labels of the discrete sequence space
- y: array-like of shape (n_genotypes,)
Quantitative phenotype or fitness associated to each genotype
- seq_length: int (None)
Length of the sequences in the sequence space. If not given, it will be guessed from alphabet or n_alleles
- n_alleles: list of size `seq_length` (None)
List containing the number of alleles present in each of the sites of the sequence space. It can only be specified for alphabet_type=custom
- alphabet_type: str (‘dna’)
Sequence type: {‘dna’, ‘rna’, ‘protein’, ‘custom’}
- alphabet: list of `seq_length’ lists
Every element of the list is itself a list containing the different alleles allowed in each site. Note that the number and type of alleles can be different for every site.
- stop_y: float (None)
Value of the function given for protein sequence with an in-frame stop codon. If given, it will increase the protein alphabet to incorporate * for stops
- Attributes:
- n_genotypes: int
Number of states in the complete sequence space
- genotypes: array-like of shape (n_genotypes, )
Genotype labels in the sequence space
- adjacency_matrix: scipy.sparse.csr_matrix of shape
(n_genotypes, n_genotypes)
Sparse matrix representing the adjacency relationships between genotypes. The ij’th entry contains a 1 if the genotypes i and j are separated by a single mutation and 0 otherwise
- y: array-like of shape (n_genotypes,)
Quantitative phenotype or fitness associated to each genotype
- is_regular: bool
Boolean variable storing whether the resulting Hamming graph is regular or not. In other words, whether every site has the same number of alleles
Methods
get_single_mutant_matrix(sequence[, center])Returns the effects of single point mutations from a focal sequences
Recalculates the adjacency matrix of the discrete space to only allow transitions that are compatible with the specified codon table
to_nucleotide_space([codon_table, alphabet_type])Transforms a protein space into a nucleotide space using a codon table for translating the sequence
- get_single_mutant_matrix(sequence, center=False)
Returns the effects of single point mutations from a focal sequences
- Parameters:
- sequence: str
String encoding the sequence from which to report all single point mutant effects
- center: bool (False)
If True, results will be centered by position, so that the mean of allelic effects is 0. If False, the focal sequence will have 0 and values would represent mutational effects from it
- Returns:
- output: pd.DataFrame of shape (seq_length, total_alleles)
pd.DataFrame containin the mutational or allelic effects for each allele across all sequence positions
- remove_codon_incompatible_transitions(codon_table='Standard')
Recalculates the adjacency matrix of the discrete space to only allow transitions that are compatible with the specified codon table
- Parameters:
- codon_table: str or Bio.Data.CodonTable
NCBI code for an existing genetic code or a custom CodonTable object to translate nucleotide sequences into protein
- to_nucleotide_space(codon_table='Standard', alphabet_type='dna')
Transforms a protein space into a nucleotide space using a codon table for translating the sequence
- Parameters:
- codon_table: str or Bio.Data.CodonTable
NCBI code for an existing genetic code or a custom CodonTable object to translate nucleotide sequences into protein
- alphabet_type: str (‘dna’)
Sequence type to use in the resulting nucleotide space It can only take one of the following values {‘dna’, ‘rna’}
- Returns:
- SequenceSpace
Nucleotide sequence space with 4 alleles per site and 3 times the number of sites of the current space
- class gpmap.space.HammingBallSpace(X0, X=None, y=None, d=None, n_alleles=None, alphabet_type='dna', alphabet=None)
Class for the space representing the Hamming ball around a target sequence up to a certain number of mutations from it.
- Parameters:
- X0: str
Focal sequence around which to build the Hamming ball space
- X: array-like of shape (n_genotypes,)
Sequences to use as state labels of the discrete sequence space
- y: array-like of shape (n_genotypes,)
Quantitative phenotype or fitness associated to each genotype
- d: int (None)
Maximum distance from the focal sequence to include in the space
- n_alleles: list of size `seq_length` (None)
List containing the number of alleles present in each of the sites of the sequence space. It can only be specified for alphabet_type=custom
- alphabet_type: str (‘dna’)
Sequence type: {‘dna’, ‘rna’, ‘protein’, ‘custom’}
- alphabet: list of `seq_length’ lists
Every element of the list is itself a list containing the different alleles allowed in each site. Note that the number and type of alleles can be different for every site.
- Attributes:
- n_genotypes: int
Number of states in the complete sequence space
- genotypes: array-like of shape (n_genotypes, )
Genotype labels in the sequence space
- adjacency_matrix: scipy.sparse.csr_matrix of shape
(n_genotypes, n_genotypes)
Sparse matrix representing the adjacency relationships between genotypes. The ij’th entry contains a 1 if the genotypes i and j are separated by a single mutation and 0 otherwise
- y: array-like of shape (n_genotypes,)
Quantitative phenotype or fitness associated to each genotype
- is_regular: bool
Boolean variable storing whether the resulting Hamming graph is regular or not. In other words, whether every site has the same number of alleles
Methods
get_neighbor_pairs()Returns a tuple with two arrays of indexes corresponding to the states that are connected to each other in the DiscreteSpace
get_neighbors(states[, max_distance])Returns the unique state labels corresponding to the d-neighbors of the provided states, where the distance is specified by max_distance
get_state_idxs(states)Returns the indexes for the provided state labels
calc_adjacency_matrix
calc_graph
calc_max_min_path
calc_n_paths
format_list_ends
format_values
get_edges_df
get_genotypes
get_y
init_space
set_alphabet_type
set_seq_length
set_y
write_csv
write_edges
Random walks
- class gpmap.randwalk.WMWalk(space, log=None, Ns=None)
Class for Weak Mutation Weak Selection Random Walk on a SequenceSpace. It is a time-reversible continuous time Markov Chain where the transition rates depend on the differences in fitnesses between two states scaled by the effective population size Ns .
- Attributes:
- spaceDiscreteSpace class
Space on which the random walk takes place
- Nsfloat
Scaled effective population size for the evolutionary model
- rate_matrixcsr_matrix
Rate matrix defining the continuous time process
Methods
set_Ns():
Method to specify the scaled effective population size Ns, either directly or by specifying the mean function at stationarity or the percentile it represents from the distribution of functions across sequence space
calc_stationary_frequencies():
Calculates the stationary frequencies of the states under the random walk specified on the discrete space
calc_rate_matrix():
Calculates the rate matrix for the continuous time process given the scaled effective population size (Ns) or average phenotype at stationarity.
- calc_neutral_mixing_rates(site_exchange_rates, neutral_site_freqs)
Calculates the neutral mixing rates for a SequenceSpace In case no GTR mutation model is specified, then the neutral mixing rates is limited by the site with the least number of alleles. Otherwise, as we assume that mutations are site-independent, the slowest neutral mixing rate is going to by limited by the slowest site, provided by the smallest of second eigenvalues in the site rate matrices
- Parameters:
- neutral_site_Qslist of array-like of shape (n_alleles, n_alleles)
List containing site-specific rate matrices to use for calculating the limiting mixing in the neutral case. If not provided, uniform mutation rates are assumed.
- neutral_site_freqslist of array-like of shape (n_alleles,)
List containing vectors with the stationary frequencies under neutrality for each site. They are used to calculate the eigenvalues of the time reversible site specific neutral chain. By default, they are assumed to be uniform across sites and alleles.
- site_weightsarray-like of shape (seq_length,)
Vector containing the relative weight associated to each site. This value is used to scale the individually normalized rates matrices to ensure this specific leaving rate. By default, all weights are equal
- Returns:
- neutral_mixing_rate: float
Neutral mixing rate as the smallest second largest eigenvalue across sites.
- TODO: Re-implement functionality
- calc_rate_matrix(Ns=None, neutral_stat_freqs=None, neutral_exchange_rates=None)
Calculates the rate matrix for the random walk in the discrete space and stores it in the attribute rate_matrix
- Parameters:
- Nsreal
Scaled effective population size for the evolutionary model
- neutral_stat_freqsarray-like of shape (n_states,)
Genotype stationary frequencies at neutrality to define the time reversible neutral dynamics
- neutral_exchange_rates: scipy.sparse.csr.csr_matrix of shape
(n_states, n_states)
Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.
- calc_visualization(Ns=None, mean_function=None, mean_function_perc=None, n_components=10, neutral_exchange_rates=None, neutral_stat_freqs=None, tol=1e-12)
Calculates the state coordinates to use for visualization of the provided discrete space under a given time-reversible random walk. The coordinates consist on the right eigenvectors of the associate rate matrix Q, re-scaled by the corresponding quantity so that the embedding is in units of square root of time
- Parameters:
- Nsfloat
Scaled effective population size to use in the underlying evolutionary model
- mean_functionfloat
Mean function at stationarity to derive the associated Ns
- mean_function_perc: float
Percentile that the mean function at stationarity takes within the distribution of function values along sequence space e.g. if mean_function_perc=98, then the mean function at stationarity is set to be at the 98th percentile across all the function values
- n_components: int (10)
Number of eigenvectors or diffusion axis to calculate
- neutral_stat_freqsarray-like of shape (n_states,)
Genotype stationary frequencies at neutrality to define the time reversible neutral dynamics
- neutral_exchange_rates: scipy.sparse.csr.csr_matrix of
shape (n_states, n_states)
Sparse matrix containing the neutral exchange rates for the whole sequence space. If not provided, uniform mutational dynamics are assumed.
- write_tables(prefix, write_edges=False, nodes_format='parquet', edges_format='npz')
Write the output of the visualization in tables with a common prefix. The output can consist in 2 to 3 different tables, as one of them may not be always necessarily stored multiple times
nodes coordinates : contains the coordinates for each state and
the associated function values and stationary frequencies. It is stored in CSV format with suffix “nodes.csv” or parquet with suffix “nodes.pq” - decay rates : contains the decay rates and relaxation times associated to each component or diffusion axis. It is stored in CSV format with suffix “decay_rates.csv” - edges : contains the adjacency relationship between states. It is not stored by default unless write_edges=True, as it will remain unchanged for any visualization on the same SequenceSpace. Therefore, so it only needs to be stored once. It can be stored in CSV format, or in the more efficent npz format for sparse matrices
- Parameters:
- prefix: str
Prefix of the files to store the different tables
- write_edges: bool (False)
Option to write also the information about the adjacency relationships between pairs for states for plotting the edges
- nodes_format: str {‘parquet’, ‘csv’}
Format to store the nodes information. parquet is more efficient but CSV can be used in smaller cases for plain text storage.
- edges_format: str {‘npz’, ‘csv’}
Format to store the edges information. npz is more efficient but CSV can be used in smaller cases for plain text storage.
Landscape Inference
- class gpmap.inference.MinimumEpistasisInterpolator(P=2, n_alleles=None, seq_length=None, alphabet_type='custom', cg_rtol=1e-16)
Methods
predict()Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes
smooth
- predict()
Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes
- Returns:
- functionpd.DataFrame of shape (n_genotypes, 1)
Returns the phenotypic predictions for each input genotype in the column
ypredand genotype labels as row names. Ifcalc_variance=True, then it has an additional column with the posterior variances for each genotype
- class gpmap.inference.MinimumEpistasisRegression(P, a=None, n_alleles=None, seq_length=None, alphabet_type='custom', nfolds=5, num_reg=20, min_log_reg=-2, max_log_reg=6, progress=True, cg_rtol=0.0001)
Methods
fit(X, y[, y_var, cross_validation])Infers the optimal a from the provided data, this is, the magnitude of Pth order local epistatic coefficients that maximize predictive performance in held out data
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
- fit(X, y, y_var=None, cross_validation=False)
Infers the optimal a from the provided data, this is, the magnitude of Pth order local epistatic coefficients that maximize predictive performance in held out data
- Parameters:
- Xarray-like of shape (n_obs,)
Vector containing the genotypes for which have observations provided by y
- yarray-like of shape (n_obs,)
Vector containing the observed phenotypes corresponding to X sequences
- y_vararray-like of shape (n_obs,)
Vector containing the empirical or experimental known variance for the measurements in y
- Returns:
- afloat
Optimal a value maximing the cross-validated log-likelihood
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
- Parameters:
- contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts)
DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution
- Returns:
- contrasts: pd.DataFrame of shape (n_contrasts, 5)
DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.
- class gpmap.inference.VCregression(lambdas=None, n_alleles=None, seq_length=None, alphabet_type='custom', beta=0, cross_validation=False, nfolds=5, cv_loss_function='frobenius_norm', num_beta=20, min_log_beta=-2, max_log_beta=7, cg_rtol=1e-16, progress=True)
Variance Component regression model that allows inference and prediction of a scalar function in sequence spaces under a Gaussian Process prior parametrized by the contribution of the different orders of interaction to the observed genetic variability of a continuous phenotype
It requires the use of the same number of alleles per sites
Methods
fit(X, y[, y_var])Infers the variance components from the provided data, this is, the relative contribution of the different orders of interaction to the variability in the sequence-function relationships
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
predict([X_pred, calc_variance])Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes
simulate([sigma, p_missing])Simulates data under the specified Variance component priors
lambdas_to_variance
- fit(X, y, y_var=None)
Infers the variance components from the provided data, this is, the relative contribution of the different orders of interaction to the variability in the sequence-function relationships
Stores learned lambdas in the attribute VCregression.lambdas to use internally for predictions and returns them as output
- Parameters:
- Xarray-like of shape (n_obs,)
Vector containing the genotypes for which have observations provided by y
- yarray-like of shape (n_obs,)
Vector containing the observed phenotypes corresponding to X sequences
- y_vararray-like of shape (n_obs,)
Vector containing the empirical or experimental known variance for the measurements in y
- Returns:
- lambdas: array-like of shape (seq_length + 1,)
Variances for each order of interaction k inferred from the data
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
- Parameters:
- contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts)
DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution
- Returns:
- contrasts: pd.DataFrame of shape (n_contrasts, 5)
DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.
- predict(X_pred=None, calc_variance=False)
Compute the Maximum a Posteriori (MAP) estimate of the phenotype at the provided or all genotypes
- Parameters:
- X_predarray-like of shape (n_genotypes,)
Vector containing the genotypes for which we want to predict the phenotype. If n_genotypes == None then predictions are provided for the whole sequence space
- calc_variancebool (False)
Option to also return the posterior variances for each individual genotype
- Returns:
- functionpd.DataFrame of shape (n_genotypes, 1)
Returns the phenotypic predictions for each input genotype in the column
ypredand genotype labels as row names. Ifcalc_variance=True, then it has an additional column with the posterior variances for each genotype
- simulate(sigma=0, p_missing=0)
Simulates data under the specified Variance component priors
- Parameters:
- sigmareal
Standard deviation of the experimental noise additional to the variance components
- p_missingfloat between 0 and 1
Probability of randomly missing genotypes in the simulated output data
- Returns:
- datapd.DataFrame of shape (n_genotypes, 3)
DataFrame with the columns
y_true,y``and ``varcorresponding to the true function at each genotype, the observed values and the variance of the measurement respectively for each sequence or genotype indicated in theDataFrame.index
- class gpmap.inference.SeqDEFT(P, n_alleles=None, seq_length=None, alphabet_type='custom', genotypes=None, a=None, num_reg=20, nfolds=5, lambdas_P_inv=None, a_resolution=0.1, max_a_max=1000000000000.0, fac_max=0.1, fac_min=1e-06, optimization_opts={}, maxiter=10000, gtol=1e-06, ftol=1e-08)
Sequence Density Estimation using Field Theory model that allows inference of a complete sequence probability distribution under a Gaussian Process prior parameterized by variance of local epistatic coefficients of order P
It requires the use of the same number of alleles per sites
- Parameters:
- Pint
Order of the local interaction coefficients that we are penalized under the prior i.e. P=2 penalizes local pairwise interaction across all posible faces of the Hamming graph while P=3 penalizes local 3-way interactions across all possible cubes.
- afloat (None)
Parameter related to the inverse of the variance of the P-order epistatic coefficients that are being penalized. Larger values induce stronger penalization and approximation to the Maximum-Entropy model of order P-1. If a=None the best a is found through cross-validation
- num_regint (20)
Number of a values to evaluate through cross-validation
- nfolds: int (5)
Number of folds to use in the cross-validation procedure
Methods
fit(X[, y, baseline_phi, baseline_X, ...])Infers the sequence-function relationship under the specified Delta^{(P)} prior
make_contrasts(contrast_matrix)Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
simulate(N[, phi, seed])Simulates data under the specified a penalization for local P-epistatic coefficients
Simulates data under the specified a penalization for local P-epistatic coefficients
- fit(X, y=None, baseline_phi=None, baseline_X=None, positions=None, phylo_correction=False, adjust_freqs=False, allele_freqs=None)
Infers the sequence-function relationship under the specified Delta^{(P)} prior
- Parameters:
- Xarray-like of shape (n_obs,)
Vector containing the observed sequences
- yarray-like of shape (n_obs,)
Vector containing the weights for each observed sequence. By default, each sequence takes a weight of 1. These weights can be calculated using phylogenetic correction
- baseline_X: array-like of shape (n_genotypes,)
Vector containing the sequences associated with baseline_phi
- baseline_phi: array-like of shape (n_genotypes,)
Vector containing the baseline_phi to include in the model
- positions: array-like of shape (n_pos,)
If provided, subsequences at these positions in the provided input sequences will be used as input
- phylo_correction: bool (False)
Apply phylogenetic correction using the full length sequences
- adjust_freqs: bool (False)
Whether to correct densities by the expected allele frequencies in the full length sequences
- allele_freqs: dict or codon_table
Dictionary containing the allele expected frequencies frequencies for every allele in the set of possible sequences or the codon table to use to genereate expected aminoacid frequencies If None, they will be calculated from the full length observed sequences.
- make_contrasts(contrast_matrix)
Computes the posterior distribution of linear combinations of genotypes under the specific Gaussian Process prior.
- Parameters:
- contrast_matrix: pd.DataFrame of shape (n_genotypes, n_contrasts)
DataFrame containing the linear combinations of genotypes for which to compute the summary of the posterior distribution
- Returns:
- contrasts: pd.DataFrame of shape (n_contrasts, 5)
DataFrame containing the summary of the posterior for each of the posterior standard deviation, lower and upper bound for the 95 % credible interval and the posterior probability for each quantity to be larger or smaller than 0.
- simulate(N, phi=None, seed=None)
Simulates data under the specified a penalization for local P-epistatic coefficients
- Parameters:
- Nint
Number of total sequences to sample
- phiarray-like of shape (n_genotypes,)
Vector containing values for the field underlying the probability distribution from which to sample sequences. If provided, they will be used instead of sampling them from the prior characterized by the given a.
- seed: int (None)
Random seed to use for simulation
- Returns:
- Xarray-like of shape (N,)
Vector containing the sampled sequences from the probability distribution
- simulate_phi()
Simulates data under the specified a penalization for local P-epistatic coefficients
- Returns:
- phiarray-like of shape (n_genotypes,)
Vector containing values for the latent phenotype or field sampled from the prior characterized by a
Sequence utils
- gpmap.seq.guess_space_configuration(seqs, ensure_full_space=True, force_regular=False, force_regular_alleles=False)
late Guess the sequence space configuration from a collection of sequences This allows to have different number of alleles per site and maintain the order in which alleles appear in the sequences when enumerating the alleles per position
- Parameters:
- seqs: array-like of shape (n_genotypes,)
Vector or list containing the sequences from which we want to infer the space configuration
- ensure_full_space: bool
Option to ensure that the whole sequence space must be represented by the set of provided sequences. This is a useful feature to identify whether there are missing genotypes before defining the space and random walk to visualize the full landscape.
- force_regular: bool
Option to ensure that there are the same number of alleles per site. New allele names will be added to sites with less than the maximum number of alleles across sites
- force_regular_alleles: bool
Option to additionally ensure that the same alleles are common across all sites
- Returns:
- config: dict with keys {‘length’, ‘n_alleles’, ‘alphabet’}
Returns a dictionary with the inferred configuration of the discrete space where the sequences come from.
- gpmap.seq.get_custom_codon_table(aa_mapping)
Builds a biopython CodonTable to use for translation with a custom genetic code
- Parameters:
- aa_mapping: pd.DataFrame
pandas DataFrame with columns “Codon” and “Letter” representing the genetic code correspondence. Stop codons should appear as “*”
- Returns:
- codon_table: Bio.Data.CodonTable.CodonTable object
Standard bioptython codon table object to use for translating sequences
- gpmap.seq.get_one_hot_from_alleles(alphabet)
Returns a one hot encoding CSR matrix for a complete combinatorial space It uses a fast recursive method to avoid repetition of building common blocks in the full matrix
- Parameters:
- alphabetlist of list
List containing lists of alleles per site in a sequence space
- Returnsscipy.sparse.csr_matrix of shape (n_genotypes, total_n_alleles)
csr matrix containing the one hot encoding of the full sequence space as with genotypes sorted lexicographically
- gpmap.seq.get_alphabet(n_alleles=None, alphabet_type=None)
Returns the resulting alphabet from specifying either the type or the number of alleles per site
- Parameters:
- n_allelesint
Number of alleles per site
- alphabet_typestr
Type of alphabet to use out of {None, ‘dna’, ‘rna’, ‘protein’}
- Returns:
- alphabetlist
List containing the alleles in the desired alphabet
- gpmap.seq.generate_freq_reduced_code(seqs, n_alleles, counts=None, keep_allele_names=True, last_character='X')
Returns a list of dictionaries with the mapping from each allele in the observed sequences to a reduced alphabet with at most
n_allelesper site. The least frequent alleles are pooled together into a single allele- Parameters:
- seqsarray-like of shape (n_genotypes,) or (n_obs,)
Observed sequences. If
counts=None, then every sequence is counted once. Otherwise, frequencies are calculated using the counts as the number of times a certain sequence appears in the data- n_allelesint or array-like of shape (seq_length, )
Maximal number of alleles per site allowed. If a list or array is provided each site will use the specified number of alleles. Otherwise, all sites will have the same maximum number of alleles
- countsNone or array-like of shape (n_genotypes, )
Number of times every sequence in
seqsappears in the data. If not provided, every provided sequence is assumed to appear exactly once- keep_allele_namesbool
If
keep_allele_names=True, then allele names are preserved. Otherwise they are replace by new alleles taken from the alphabet- last_characterstr
Character to use for remaining alleles when
keep_allele_names=True
- Returns:
- codelist of dict of length seq_length
List of dictionaries containing the new allele corresponding to each of the original alleles for each site.
- gpmap.seq.transcribe_seqs(seqs, code)
- gpmap.seq.translate_seqs(seqs, codon_table='Standard')
- gpmap.seq.msa_to_counts(X, y=None, positions=None, phylo_correction=False, max_dist=0.2)
Obtains a series of sequences and their counts from a Multiple Sequence Alignment (MSA) provided as a list of sequences. It can select subsequences by selecting which positions to look at in the MSA and do sequence identity re-weighting by considering the sequence similarities across the full length sequence
- Parameters:
- Xarray-like of aligned sequences
Input sequences from which to extract counts
- yarray-like of weights (None)
Pre-calculated weights associated to the input sequences
- positionsarray-like of positions (None)
If provided, subsequences at this subset of positions will be used to provide counts or re-weighted counts
- phylo_correctionbool (False)
If True, observations will be re-weighted using sequence similarity along the whole sequence as 1 over the number of similar sequences in the MSA. Similar sequences are defined as those that differ less from each other than the specified
`max_dist`- max_distfloat (0.2)
Pairs of sequences that differ more than this value will be consired similar for re-weighting
- Returns:
- X: np.array of shape (n_unique_seqs, )
Unique subsequences at the specified positions in the MSA
- y: np.array of shape (n_unique_seqs, )
Counts or re-weighted counts for each of the unique subsequences in the MSA
Genotypes handling
- gpmap.utils.read_dataframe(fpath)
- gpmap.utils.read_edges(fpath, log=None, return_df=True)
Reads the incidence matrix containing the adjacency information among genotypes from a sequence space
- Parameters:
- fpathstr
File path containing the edges of a sequence space. The extension will be used to differentiate between csv and the more efficient npz format
- return_dfbool (True)
Whether to return a pd.DataFrame with the edges. Alternatively it will return a csr_matrix
- Returns:
- edges_dfpd.DataFrame of shape (n_edges, 2) or csr_matrix
DataFrame with column names
iandjcontaining the indexes of the genotypes that are separated by a single mutation in a sequence space
- gpmap.genotypes.select_genotypes(nodes_df, genotypes, edges=None, is_idx=False)
Selects the provided genotypes from nodes_df with the corresponding edges among the remaining genotypes if edges are provided
- Parameters:
- nodes_df: pd.DataFrame of shape (n_genotypes, n_features)
DataFrame with the genotypes from a full sequence space as index Typically, it will contain, at least, the coordinates of the visualization for each genotype, but it will keep any other column in the DataFrame for later use
- genotypes: array-like of shape (n_genotypes,)
Array of ordered genotypes to select from the starting landscape It should contain the genotype labels by default, or indexes if option is_idx is provided
- edges: pd.DataFrame of shape (n_edges, 2) or scipy.sparse.csr_matrix
of shape (n_genotypes, n_genotypes)
DataFrame or csr_matrix containing the adjacency relationships among genotypes provided in nodes_df in the discrete space
- is_idx: bool
The genotypes argument is an array of indexes instead of an array of genotype labels to select genotypes
- Returns:
- output: (nodes_df, edges)
Filtered landscape containing the selected genotypes and the adjacency relationships between them given as a tuple
- gpmap.genotypes.get_genotypes_from_region(nodes_df, max_values={}, min_values={})
Returns the genotype labels matching the specified conditions as maximum and minimum values of the dataframe
- Parameters:
- nodes_df: pd.DataFrame of shape (n_genotypes, n_features)
DataFrame with the genotypes from a full sequence space as index Typically, it will contain, at least, the coordinates of the visualization for each genotype, but it will keep any other column in the DataFrame for later use
- max_valuesdict
Dictionary with column names as keys and max values to filter genotypes as values
- min_valuesdict
Dictionary with column names as keys and min values to filter genotypes as values
- Returns:
- genotypesarray-like of shape (n_selected,)
Array containing the selected genotypes from the input dataframe
- gpmap.genotypes.marginalize_landscape_positions(nodes_df, keep_pos=None, skip_pos=None, return_edges=False)
Averages out some positions in the sequences for all numeric values provided in the input dataframe
- Parameters:
- nodes_dfpd.DataFrame
DataFrame with sequence names as index and at least one numeric column to calculate the average across the selected backgrounds
- keep_posarray-like (None)
If provided, list of 0-index positions that are to be preserved and averaged across all genetic backgrounds specified by the remaining positions
- skip_posarray-like (None)
If provided, list of 0-index positions to average out
- return_edgesbool (False)
Return also an edges_df DataFrame to use directly for visualization
- Returns:
- nodes_dfpd.DataFrame
DataFrame containing the average value of every numeric column in the input DataFrame with the subsequences at the desired positions as index
- edges_dfpd.DataFrame
DataFrame containing the edges of the reduced sequence space. It will only be provided if
`return_edges=True`
Plotting
- gpmap.plot.mpl.plot_relaxation_times(decay_df, axes=None, fpath=None, log_scale=False, neutral_time=None, kwargs={})
Plots the relaxation times associated to each of the calculated components from using ``WMWalk.calc_visualization
- Parameters:
- decay_dfpd.DataFrame of shape (n_components, 3)
pd.DataFramecontaining the decay rates and the associated mean relaxation times for each of the calculated components- axesmatplotlib Axes object (None)
Axeswhere to plot. If not provided, a new figure will be created automatically for this plot and save in the path provided byfpath- fpathstr (None)
File path to store the plot. If
fpath=None,axesargument must be provided for plotting.- log_scalebool (False)
Plot the relaxation times in log scale
- neutral_timefloat (None)
If provided, an additional horizontal line will be plotted representing the relaxation time associated to the neutral process. This is useful when selecting the number of relevant dimensions to plot
- kwargsdict
Additional key-word arguments dictionary provided for
axes.plotandaxes.scattere.g. color.
- gpmap.plot.mpl.plot_edges(axes, nodes_df, edges_df, x='1', y='2', z=None, alpha=0.1, zorder=1, color='grey', cbar=True, cmap='binary', cbar_axes=None, cbar_orientation='vertical', cbar_label='', palette=None, legend=True, legend_loc=0, width=0.5, max_width=1, min_width=0.1, fontsize=None)
Plots the edges representing the connections between states that are conneted in the discrete space under a particular embedding
- Parameters:
- axesmatplotlib
Axesin which to plot the edges. - nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)
pd.DataFramecontaining the coordinates in every of then_componentsin addition to the “function” and “stationary_freq” columns. Additional columns are also allowed- edges_dfpd.DataFrame of shape (n_edges, 2)
pd.DataFramethe connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.- xstr (‘1’)
Column in
nodes_dfto use for plotting the genotypes on the x-axis- ystr (‘2’)
Column in
nodes_dfto use for plotting the genotypes on the y-axis- zstr (None)
Column in
nodes_dfto use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the providedaxesobject allows it.- alphafloat (0.1)
Transparency of lines representing the edges
- zorderint (1)
Order in which the edges will be rendered relative to other elements. Generally, we would want this to be smaller than the
zorderused for plotting the nodes- colorstr (‘grey’)
Column name for the values according to which edges will be colored or the specific color to use for plotting the edges
- cmapcolormap or str
Colormap to use for coloring the edges according to column
color- widthfloat or str
Width of the lines representing the edges. If a
floatis provided, that will be the width used to plot every edges. Ifstr, then widths will be scaled according to the corresponding column inedges_df.- max_widthfloat (1)
Maximum linewidth for the edges when scaled by
- min_widthfloat (0.1)
Maximum linewidth for the edges when scaled by
- axesmatplotlib
- Returns:
- line_collectionLineCollection or Line3DCollection
- gpmap.plot.mpl.plot_nodes(axes, nodes_df, x='1', y='2', z=None, alpha=1, zorder=2, sort_by=None, sort_ascending=False, color='function', cmap='viridis', cbar=True, cbar_axes=None, cbar_label='Function', cbar_orientation='vertical', vcenter=None, vmax=None, vmin=None, palette='Set1', size=2.5, max_size=40, min_size=1, lw=0, edgecolor='black', legend=True, legend_loc=0)
Plots the nodes representing the states of the discrete space on the provided coordinates
- Parameters:
- axesmatplotlib
Axesin which to plot the nodes or states. - nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)
pd.DataFramecontaining the coordinates in every of then_componentsin addition to the “function” and “stationary_freq” columns. Additional columns are also allowed- xstr (‘1’)
Column in
nodes_dfto use for plotting the genotypes on the x-axis- ystr (‘2’)
Column in
nodes_dfto use for plotting the genotypes on the y-axis- zstr (None)
Column in
nodes_dfto use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the providedaxesobject allows it.- alphafloat (1)
Transparency of markers representing the nodes
- zorderint (2)
Order in which the nodes will be rendered relative to other elements. Generally, we would want this to be bigger than the
zorderused for plotting the edges- colorstr (‘grey’)
Column name for the values according to which states will be colored or the specific color to use for plotting the states
- vcenterbool (False)
Center the color scale around the 0 value
- vmaxfloat
Maximum value to show in the colormap
- vminfloat
Minimum value to show in the colormap
- cmapcolormap or str
Colormap to use for coloring the nodes according to column
color- cbarbool
Boolean variable representing whether to show the colorbar
- cbar_labelstr
Label for the colorbar associated to the nodes color scale
- cbar_axesmatplotlib
Axes Axes to plot the colorbar. If not provided, it will be automatically adjusted to the current Axes
- palettedict
Dictionary containing the colors associated to the categories specified by the column
colorinnodes_df, if they express categories rather than numerical values- sizefloat (2.5)
Size of the markers provided for plotting to
axes.scatter. If afloatis provided, that will be the size used to plot every nodes. Ifstr, then node sizes will be scaled according to the corresponding column innodes_df.- max_sizefloat (1)
Maximum linewidth for the edges when scaled by
- min_sizefloat (0.1)
Maximum linewidth for the edges when scaled by
- lwfloat (0)
Width of the line edges delimiting the markers representing the nodes
- edgecolorstr (‘black’)
Color of the line edges delimiting the markers representing the nodes
- legend: bool (True)
Show legend on the plot
- legend_locint or tuple
Location of the legend in case of coloring according to a categoric variable
- axesmatplotlib
- Returns:
- line_collectionLineCollection or Line3DCollection
- gpmap.plot.mpl.plot_visualization(axes, nodes_df, edges_df=None, x='1', y='2', z=None, nodes_alpha=1, nodes_zorder=2, nodes_color='function', nodes_cmap='viridis', nodes_palette=None, nodes_vmin=None, nodes_vmax=None, nodes_vcenter=False, nodes_cbar=True, nodes_cbar_axes=None, nodes_cmap_label='Function', nodes_size=2.5, nodes_min_size=1, nodes_max_size=40, nodes_lw=0, nodes_edgecolor='black', edges_alpha=0.1, edges_zorder=1, edges_color='grey', edges_cmap='binary', edges_palete=None, edges_cbar=False, edges_cbar_axes=None, edges_width=0.5, edges_max_width=1, edges_min_width=0.1, sort_by=None, sort_ascending=True, center_spines=False, add_hist=False, inset_cbar=False, inset_pos=(0.7, 0.7), prev_nodes_df=None)
Plots the nodes representing the states of the discrete space on the provided coordinates and the edges representing the connections between states that are conneted if provided
- Parameters:
- axesmatplotlib
matplotlib
Axesin which to plot the edges.- pd.DataFrame of shape (n_genotypes, n_variables)
pd.DataFramecontaining the coordinates in every of then_componentsin addition to the “function” and “stationary_freq” columns. Additional columns are also allowed- pd.DataFrame of shape (n_edges, 2)
pd.DataFramethe connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.- xstr, optional
_description_, by default ‘1’
- ystr, optional
_description_, by default ‘2’
- z_type_, optional
_description_, by default None
- nodes_alphaint, optional
_description_, by default 1
- nodes_zorderint, optional
_description_, by default 2
- nodes_colorstr, optional
_description_, by default ‘function’
- nodes_cmapstr, optional
_description_, by default ‘viridis’
- nodes_palette_type_, optional
_description_, by default None
- nodes_vmin_type_, optional
_description_, by default None
- nodes_vmax_type_, optional
_description_, by default None
- nodes_vcenterbool, optional
_description_, by default False
- nodes_cbarbool, optional
_description_, by default True
- nodes_cbar_axes_type_, optional
_description_, by default None
- nodes_cmap_labelstr, optional
_description_, by default ‘Function’
- nodes_sizefloat, optional
_description_, by default 2.5
- nodes_min_sizeint, optional
_description_, by default 1
- nodes_max_sizeint, optional
_description_, by default 40
- nodes_lwint, optional
_description_, by default 0
- nodes_edgecolorstr, optional
_description_, by default ‘black’
- edges_alphafloat, optional
_description_, by default 0.1
- edges_zorderint, optional
_description_, by default 1
- edges_colorstr, optional
_description_, by default ‘grey’
- edges_cmapstr, optional
_description_, by default ‘binary’
- edges_palete_type_, optional
_description_, by default None
- edges_cbarbool, optional
_description_, by default False
- edges_cbar_axes_type_, optional
_description_, by default None
- edges_widthfloat, optional
_description_, by default 0.5
- edges_max_widthint, optional
_description_, by default 1
- edges_min_widthfloat, optional
_description_, by default 0.1
- sort_by_type_, optional
_description_, by default None
- sort_ascendingbool, optional
_description_, by default False
- center_spinesbool, optional
_description_, by default False
- add_histbool, optional
_description_, by default False
- inset_cbarbool, optional
_description_, by default False
- inset_postuple, optional
_description_, by default (0.7, 0.7)
- prev_nodes_df_type_, optional
_description_, by default None
- gpmap.plot.mpl.figure_Ns_grid(rw, x='1', y='2', pmin=0, pmax=0.8, ncol=4, nrow=3, show_edges=True, fpath=None, **kwargs)
- gpmap.plot.mpl.figure_allele_grid(nodes_df, edges_df=None, allele_color='orange', background_color='lightgrey', positions=None, position_labels=None, colsize=3, rowsize=2.7, xpos_label=0.05, ypos_label=0.92, fmt='png', fpath=None, **kwargs)
- gpmap.plot.ply.plot_visualization(nodes_df, edges_df=None, x='1', y='2', z=None, nodes_color='function', nodes_size=4, nodes_cmap='viridis', nodes_cmap_label='Function', edges_width=0.5, edges_color='#888', edges_alpha=0.2, text=None, fpath=None)
Makes an interactive plot of fitness landscape with genotypes as nodes and single point mutations as edges using plotly
- Parameters:
- nodes_dfpd.DataFrame of shape (n_genotypes, n_components + 2)
pd.DataFramecontaining the coordinates in every of then_componentsin addition to the “function” and “stationary_freq” columns. Additional columns are also allowed- edges_dfpd.DataFrame of shape (n_edges, 2)
pd.DataFramethe connectivity information between states of the discrete space to plot. It has columns “i” and “j” for the indexes of the pairs of states that are connected.- xstr (‘1’)
Column in
nodes_dfto use for plotting the genotypes on the x-axis- ystr (‘2’)
Column in
nodes_dfto use for plotting the genotypes on the y-axis- zstr (None)
Column in
nodes_dfto use for plotting the genotypes on the z-axis. If provided, then a 3D plot will be produced as long as the providedaxesobject allows it.- nodes_colorstr (‘function’)
Column name for the values according to which states will be colored or the specific color to use for plotting the states
- nodes_sizefloat (2.5)
Size of the markers provided for plotting to
axes.scatter. If afloatis provided, that will be the size used to plot every nodes. Ifstr, then node sizes will be scaled according to the corresponding column innodes_df.- nodes_cmapcolormap or str
Colormap to use for coloring the nodes according to column
color- nodes_cmap_labelstr
Label for colorbar
- edges_widthfloat or str
Width of the lines representing the edges. If a
floatis provided, that will be the width used to plot every edges. Ifstr, then widths will be scaled according to the corresponding column inedges_df.- edges_colorstr
Column name for the values according to which edges will be colored or the specific color to use for plotting the edges
- edges_alphafloat (0.2)
Transparency of lines representing the edges
- textarray-like of shape (nodes_df.shape[0]) (None)
Labels to show for each state when hovering over the markers representing them. If not provided, rownames of the nodes_df DataFrame will be used
- fpathstr
File path in which to store the interactive plot as an html file
- gpmap.plot.ds.plot_visualization(nodes_df, x='1', y='2', edges_df=None, nodes_color='function', nodes_cmap='viridis', nodes_size=5, nodes_vmin=None, nodes_vmax=None, linewidth=0, edgecolor='black', sort_by=None, sort_ascending=False, edges_width=0.5, edges_alpha=1, edges_color='grey', edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, shade_nodes=True, shade_edges=True, square=True)
- gpmap.plot.ds.figure_allele_grid(nodes_df, fpath, x='1', y='2', edges_df=None, positions=None, position_labels=None, edges_cmap='grey', background_color='white', nodes_resolution=800, edges_resolution=1200, sort_by=None, sort_ascending=False, fmt='png', figsize=None, square=True, **kwargs)
- gpmap.plot.mpl.plot_SeqDEFT_summary(log_Ls, seq_density=None, err_bars='stderr', show_folds=False, legend_loc=1, normalize_logL=True)
Generates a 2 panel figure showing how the cross-validated likelihood changes with
ahyperparameter and the best selected value for model fitting.- Parameters:
- log_Lspd.DataFrame of shape (num_a, 3)
DataFrame containing the column names
a,logLandfold`- seq_densitypd.DataFrame of shape (n_genotypes, >= 2)
DataFrame with column names
frequency,Qwith the observed frequencies and estimated densities for each possible sequence respectively. If not provided only a 1 panel figure with the cross-validated likelihood curve will be provided- err_barsstr
What to show in the error bars: sd standard deviation across the different folds or stderr for standard error of the mean
- show_folds: bool
Whether to show the out of sample log likelihoods for the different folds in the cross-validation procedure separately
- Returns:
- figmatplotlib.figure object
Figureobject containing the resulting plots
Datasets
- gpmap.datasets.list_available_datasets()
Returns a list with the names of all available built-in datasets
- class gpmap.datasets.DataSet(dataset_name, data=None, landscape=None)
DataSet object that allows convenient manipulation of the different objets related with a given dataset. This includes the original data, the reconstructed landscape, visualization coordinates
- Parameters:
- dataset_namestr
Name of the dataset to load from the built-in list. If data or landscape are provided, it will be the name given to the new dataset
- data: pd.DataFrame of shape (n_obs, n_features)
Dataframe containing the experimental data using genotypes as index
- landscape: pd.DataFrame of shape (n_genotypes, 1)
Dataframe containing the complete combinatorial landscape from which to build the remaining objects of the dataset
- Attributes:
- data
- edges
- landscape
- nodes
- relaxation_times
Methods
calc_visualization
plot
save
to_sequence_space