Experimental data
In contrast to theoretical or computational landscapes, like the Serine codon model shown in previous sections, empirical landscapes are based on experimental data. Such data is typically noisy and incomplete, so we need to apply additional methods to obtain reliable estimates of the phenotype of every possible genotype in order to visualize the genotype-phenotype map. In this section, we will illustrate the two methods available in gpmap-tools for inference of complete genotype-phenotype
maps \(f\) from noisy measurements y at a subset of sequences X. Optionally, the variance associated to the measurement y_var can be provided.
Note: these methods only work for sequences of the same length
Minimum epistasis interpolation (MEI) finds the genotype-phenotype map that minimizes the sum of squared P-th epistatic coefficients \(\bar{\epsilon_P^2}\) while exactly matching the data
yat sequencesX(read more).Empirical VC regression estimates a prior distribution characterized by the variance explained by interactions of every possible order from the empirical distance-correlation function computed from the available data. Then, it uses this prior distribution to perform Gaussian process regression to infer a complete genotype-phenotype map (read more).
In this section, we will illustrate how to use these methods using simulated and real data and how to use them to evaluate the evidence supporting specific hypotheses about the genotype-phenotype map.
[1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, norm
from gpmap.datasets import DataSet
from gpmap.inference import VCregression, MinimumEpistasisInterpolator
from gpmap.plot.mpl import init_fig
Inference from simulated data
We first illustrate how to use these methods using simulated data. In this case, we know what the true genotype-phenotype map is and can compare the model inferences with the real values.
How to sample functions from the prior distribution
We will simulate data from the Variance Component regression model. In order to do so, we need to define the configuration of sequence space, the variance associated to interactions of every possible order lambdas, standard deviation of the measurement noise sigma and the fraction of genotypes that are not measured in the experiment p_missing.
[2]:
np.random.seed(0)
lambdas_true = np.array([1e4, 1e3, 2e2, 1e0, 1e-1, 3e-3, 1e-5])
model = VCregression(seq_length=6, alphabet_type='dna', lambdas=lambdas_true)
f, X, y, y_var = model.simulate(y_var=0.04, p_missing=0.1)
idx = model.get_obs_idx(X)
X_test = np.delete(model.genotypes, idx)
f_test = np.delete(f, idx)
data = pd.DataFrame({'y': y, 'y_var': y_var, 'f': f[idx]}, index=X)
data
[2]:
| y | y_var | f | |
|---|---|---|---|
| AAAAAA | 0.640632 | 0.04 | 0.496707 |
| AAAAAC | -2.314924 | 0.04 | -2.381059 |
| AAAAAG | -2.150190 | 0.04 | -2.072592 |
| AAAAAT | 1.466791 | 0.04 | 1.143917 |
| AAAACA | 3.854757 | 0.04 | 4.207335 |
| ... | ... | ... | ... |
| TTTTGG | 0.791477 | 0.04 | 0.805718 |
| TTTTGT | 1.621376 | 0.04 | 1.522178 |
| TTTTTA | 3.755850 | 0.04 | 3.650933 |
| TTTTTC | 1.187770 | 0.04 | 1.162799 |
| TTTTTG | 3.084332 | 0.04 | 2.627240 |
3693 rows × 3 columns
How to compute the minimum epistasis interpolation solution
This can be done very simply by defining a MinimumEpistasisInterpolator object with the right configuration of sequence space and the order of local epistatic coefficients we aim to penalize. In this case, we set P=2 to minimize the sum of squared epistatic coefficients.
[3]:
model = MinimumEpistasisInterpolator(P=2, seq_length=6, alphabet_type='dna')
model.set_data(X=X, y=y)
mei = model.predict()
mei
[3]:
| f | |
|---|---|
| AAAAAA | 0.640632 |
| AAAAAC | -2.314924 |
| AAAAAG | -2.150190 |
| AAAAAT | 1.466791 |
| AAAACA | 3.854757 |
| ... | ... |
| TTTTGT | 1.621376 |
| TTTTTA | 3.755850 |
| TTTTTC | 1.187770 |
| TTTTTG | 3.084332 |
| TTTTTT | 3.728994 |
4096 rows × 1 columns
We can now compare the predicted with the true phenotype at the held-out test sequences
[4]:
r2 = pearsonr(mei['f'], f)[0] ** 2
fig, axes = plt.subplots(1, 1, figsize=(4, 4))
axes.scatter(x=mei['f'], y=f, c='black', alpha=0.5, s=5, lw=0)
ticks = np.arange(-10, 9, 2)
axes.axline((0, 0), (1, 1), lw=0.75, linestyle='--', c='grey')
axes.set(xlabel=r'$y_{True}$', ylabel=r'$y_{pred}$', aspect='equal', xlim=(-15, 10), ylim=(-15, 10))
axes.text(0.03, 0.97, r'$R^2$=' + '{:.3f}'.format(r2), transform=axes.transAxes,
va='top')
[4]:
Text(0.03, 0.97, '$R^2$=0.995')
This comparison shows a very high correlation between the predictions and the real phenotypic values, but the extremely low and high values tend to be over and underestimated, respectively.
How to estimate Variance Components \(\lambda\)’s
We next illustrate how to use Variance Component regression on the same simulated data for comparison. The first step is the estimation of the variance components from the data. The covariance matrix \(K\) under the prior is given by
We define the \(\lambda_k\) through kernel alignment. This is achieved by finding the non-negative linear combination of the covariance matrices \(K_k\) associated to each pure interactions of order \(k\) that that minimize the Frobenius norm of the difference with the empirical second moment matrix \((y - \bar y)^T(y - \bar y)\)
[5]:
model = VCregression(seq_length=6, alphabet_type="dna")
model.fit(X, y, y_var)
vc = model.lambdas_to_variance(model.lambdas)
vc_p = vc / vc.sum() * 100
[6]:
k = np.arange(model.lambdas.shape[0])
fig, subplots = plt.subplots(1, 3, figsize=(12, 3.5))
axes = subplots[0]
axes.bar(x=k, height=model.lambdas, color='black')
axes.set(xlabel='k', ylabel=r'$\lambda_k$', yscale='log',
xticks=np.arange(7))
axes = subplots[1]
axes.bar(x=k[1:], height=vc_p, color='black')
axes.set(xlabel='k', ylabel='% variance explained')
axes = subplots[2]
axes.scatter(lambdas_true, model.lambdas, c='black')
lims = (1e-5, 1e6)
axes.plot(lims, lims, c='grey', lw=0.3, alpha=0.5)
axes.set(xscale='log', yscale='log',
xlabel=r'True $\lambda_k$',
ylabel=r'Inferred $\lambda_k$',
xlim=lims, ylim=lims)
fig.tight_layout()
We can see that the inferred variance components are very similar in log scale to the true data-generating parameters.
However, when data is sparse, kernel alignment is a hard problem and we need to regularize towards simpler solutions. While we could shrink towards additivity, another good general choice that still models high order interactions is towards exponential decay of the variance components. We can do it by setting cross_validation=True and penalizing the second order moments of the \(\log \lambda_k\) as a function of \(k\). It will automatically split the data into different folds and
perform kernel alignment under different regularization strength to find the penalization constant that best generalizes to the validation set according to a metric. These metrics can be the Frobenius of the difference between the prior and second moment matrix, \(R^2\) on the or the log-likelihood of the predictions on the validation set. This can be specified through the argument cv_loss_function='r2' ,cv_loss_function='logL'.
How to obtain phenotypic predictions under a VC prior
Once we know the parameters of the Variance Component prior, we can compute the posterior distribution for the complete genotype-phenotype map with the method predict
[7]:
model.fit(X, y, y_var)
pred = model.predict()
pred
[7]:
| f | |
|---|---|
| AAAAAA | 0.509161 |
| AAAAAC | -2.099185 |
| AAAAAG | -2.021162 |
| AAAAAT | 1.261439 |
| AAAACA | 3.904009 |
| ... | ... |
| TTTTGT | 1.799622 |
| TTTTTA | 3.642860 |
| TTTTTC | 1.275835 |
| TTTTTG | 2.718211 |
| TTTTTT | 4.253923 |
4096 rows × 1 columns
We can now compare the predictions with the true phenotypic values in held-out sequences, achieving better predictive performance than with MEI.
[8]:
r2 = pearsonr(pred["f"], f)[0] ** 2
fig, axes = plt.subplots(1, 1, figsize=(4, 4))
axes.scatter(x=pred["f"], y=f, c="black", alpha=0.5, s=5, lw=0)
ticks = np.arange(-10, 9, 2)
axes.axline((0, 0), (1, 1), lw=0.75, linestyle="--", c="grey")
axes.set(
xlabel=r"$y_{True}$",
ylabel=r"$y_{pred}$",
aspect="equal",
xlim=(-15, 10),
ylim=(-15, 10),
)
axes.text(
0.03,
0.97,
r"$R^2$=" + "{:.3f}".format(r2),
transform=axes.transAxes,
va="top",
)
[8]:
Text(0.03, 0.97, '$R^2$=0.998')
How to calculate uncertainty of the predictions
In the previous section, we have shown how to compute the posterior mean of the complete genotype-phenotype given the data and the estimated variance components. However, in cases where the ground truth is not known, it can be useful to have an idea about the uncertainty of the phenotypic predictions for new sequences. As we are using Gaussian process models with Gaussian likelihood function, we can compute, not only the posterior mean, but also the posterior variance and 95% probability
credible interval for these sequences by setting calc_variance=True.
Note: Computing the variance for every possible sequence is usually unfeasible because we need to solve a system of \(~\alpha^\ell\) equations for computing the variance of each sequence. Thus, in practice, this calculation can only be done for a small subset of sequences of interest.
[9]:
pred = model.predict(X_test, calc_variance=True)
pred['f'] = f[model.likelihood.idx[X_test]]
pred
100%|██████████| 403/403 [00:16<00:00, 24.33it/s]
[9]:
| f | f_var | f_std | ci_95_lower | ci_95_upper | |
|---|---|---|---|---|---|
| AAAACG | -1.661166 | 0.022802 | 0.151005 | -2.026398 | -1.422378 |
| AAAACT | 1.739073 | 0.021753 | 0.147488 | 1.361176 | 1.951128 |
| AAAAGA | 1.857192 | 0.021762 | 0.147520 | 1.439720 | 2.029800 |
| AAACGG | -5.395939 | 0.020771 | 0.144121 | -5.487868 | -4.911382 |
| AAAGAT | 1.554774 | 0.022872 | 0.151234 | 1.277205 | 1.882139 |
| ... | ... | ... | ... | ... | ... |
| TTTGCC | -8.340758 | 0.021756 | 0.147500 | -8.462849 | -7.872850 |
| TTTGCG | -4.501158 | 0.021703 | 0.147320 | -4.511932 | -3.922653 |
| TTTGTA | -3.320864 | 0.020772 | 0.144126 | -3.539676 | -2.963171 |
| TTTTCA | 5.393270 | 0.020795 | 0.144206 | 4.917685 | 5.494508 |
| TTTTTT | 4.110144 | 0.025390 | 0.159342 | 3.935240 | 4.572607 |
403 rows × 5 columns
We can now check that predictions are well calibrated, this is, that the true value lies withing the 95% probability predicted interval close to the expected 95% probability; and show the error bars in the scatterplot to visualize if uncertainty is uniformly distributed along the whole range of true phenotypic values.
[10]:
perc = np.mean((f_test > pred['ci_95_lower']) & (f_test < pred['ci_95_upper'])) * 100
fig, axes = plt.subplots(1, 1, figsize=(5, 5))
axes.errorbar(
x=f_test,
y=pred["f"],
yerr=2 * pred["f_std"],
c="black",
markersize=3,
fmt="o",
alpha=0.2,
)
axes.axline((0, 0), (1, 1), lw=0.5, c='grey', linestyle='--')
axes.set(
xlabel=r"$y_{True}$",
ylabel=r"$y_{pred}$",
aspect=1,
xlim=(-15, 10),
ylim=(-15, 10),
)
axes.text(0.03, 0.97, 'Calibration={:.2f}%'.format(perc), transform=axes.transAxes, va='top')
[10]:
Text(0.03, 0.97, 'Calibration=92.06%')
How to estimate mutational effects and epistatic coefficients under the prior
Because the posterior distribution of the sequence function relationship is also a multivariate gaussian,
We can easily compute the posterior for any linear combination represented by \(B\) of a multivariate normal distribution
Without any prior distribution, we can assume that \(\Sigma=D_{\sigma^2}\) and \(\mu=y\), and use the following function to compute the posterior distribution for specific linear combinations given by the contrast matrix \(B\)
[11]:
def make_contrasts(df, contrast_matrix):
f = df['y']
Sigma = np.diag(df['y_var'])
B = contrast_matrix.T.values
mu = B @ f
std = np.sqrt(np.diag(B @ Sigma @ B.T))
posterior = norm(mu, std)
p = posterior.cdf(0.)
p = np.max(np.vstack([p, 1-p]), axis=0)
contrasts = pd.DataFrame({'estimate': mu, 'std': std,
'ci_95_lower': mu - 2 * std,
'ci_95_upper': mu + 2 * std,
'p(|x|>0)': p},
index=contrast_matrix.columns)
return(contrasts)
Lets start by estimating mutational effects fromTTTTTT to TTTTTC:
[12]:
seqs = ['TTTTTA', 'TTTTTC']
df = data.loc[seqs, :]
df
[12]:
| y | y_var | f | |
|---|---|---|---|
| TTTTTA | 3.75585 | 0.04 | 3.650933 |
| TTTTTC | 1.18777 | 0.04 | 1.162799 |
[13]:
contrast_matrix = pd.DataFrame({'A6C': [1, -1]}, index=['TTTTTA', 'TTTTTC'])
contrast_matrix
[13]:
| A6C | |
|---|---|
| TTTTTA | 1 |
| TTTTTC | -1 |
Lets start by computing the true mutational effect from the simulated genotype-phenotype map
[14]:
true_effect = (contrast_matrix.T.values @ df['f'].values)[0]
print('True mutational effect = {:.2f}'.format(true_effect))
True mutational effect = 2.49
and compare it with our naive posterior estimate.
[15]:
contrasts = make_contrasts(df, contrast_matrix)
mu = contrasts.loc['A6C', 'estimate']
lower = contrasts.loc['A6C', 'ci_95_lower']
upper = contrasts.loc['A6C', 'ci_95_upper']
print('Estimated mutational effect = {:.2f} (95% CI = [{:.2f}, {:2f}])'.format(mu, lower, upper))
Estimated mutational effect = 2.57 (95% CI = [2.00, 3.133766])
We can see that the true effect is relatively close to the inferred one, and is within the 95% credible interval. However, the 95% credible interval is quite wide. Thus, can we do better by including taking into account our prior distribution?
We can do the same calculations using the make_contrast method provided any user-defined contrast_matrix
[16]:
contrasts = model.make_contrasts(contrast_matrix=contrast_matrix)
mu = contrasts.loc['A6C', 'estimate']
lower = contrasts.loc['A6C', 'ci_95_lower']
upper = contrasts.loc['A6C', 'ci_95_upper']
print('Estimated mutational effect with GP prior = {:.2f} (95% CI = [{:.2f}, {:2f}])'.format(mu, lower, upper))
100%|██████████| 2/2 [00:00<00:00, 18.83it/s]
Estimated mutational effect with GP prior = 2.37 (95% CI = [2.06, 2.672019])
Thus, we can see how the estimate now is closer to the true simulated mutational effect and that the posterior distribution is tighter around it, illustrating that we have less uncertainty about its effect when taking into account the observed mutational effects across all genetic backgrounds appropriately.
We can do these analyses not only for mutational effects, but also for epistatic coefficients of any order. Lets see how the previous mutational effect changes when we add a T0A mutation at the first position:
[17]:
seqs = ['TTTTTA', 'TTTTTC', 'ATTTTA', 'ATTTTC']
df = data.loc[seqs, :]
df
[17]:
| y | y_var | f | |
|---|---|---|---|
| TTTTTA | 3.755850 | 0.04 | 3.650933 |
| TTTTTC | 1.187770 | 0.04 | 1.162799 |
| ATTTTA | 2.589606 | 0.04 | 2.611931 |
| ATTTTC | 2.904488 | 0.04 | 2.793064 |
Lets do it again first without using our prior distribution
[18]:
contrast_matrix = pd.DataFrame({'epistatic_coeff': [1, -1, -1, 1]},
index=seqs)
contrast_matrix
[18]:
| epistatic_coeff | |
|---|---|
| TTTTTA | 1 |
| TTTTTC | -1 |
| ATTTTA | -1 |
| ATTTTC | 1 |
[19]:
true_effect = (contrast_matrix.T.values @ df['f'].values)[0]
print('True epistatic coefficient = {:.2f}'.format(true_effect))
True epistatic coefficient = 2.67
[20]:
contrasts = make_contrasts(df, contrast_matrix)
mu = contrasts.loc['epistatic_coeff', 'estimate']
lower = contrasts.loc['epistatic_coeff', 'ci_95_lower']
upper = contrasts.loc['epistatic_coeff', 'ci_95_upper']
print('Estimated epistatic coefficient = {:.2f} (95% CI = [{:.2f}, {:2f}])'.format(mu, lower, upper))
Estimated epistatic coefficient = 2.88 (95% CI = [2.08, 3.682962])
[21]:
contrasts = model.make_contrasts(contrast_matrix=contrast_matrix)
mu = contrasts.loc['epistatic_coeff', 'estimate']
lower = contrasts.loc['epistatic_coeff', 'ci_95_lower']
upper = contrasts.loc['epistatic_coeff', 'ci_95_upper']
print('Estimated epistatic coefficient with GP prior = {:.2f} (95% CI = [{:.2f}, {:2f}])'.format(mu, lower, upper))
100%|██████████| 2/2 [00:00<00:00, 14.18it/s]
Estimated epistatic coefficient with GP prior = 2.46 (95% CI = [2.11, 2.806350])
Again, we see how incorporation the prior knowledge about how well mutational effects generalize across genetic backgrounds provides a better estimate of the true epistatic coefficient and a tighter credible interval around it.
Inference from real data: protein GB1
In the next section, we are going to infer the complete combinatorial landscape of the protein G domain B1 from experimental data [Wu et. al (2016)]. In this study, a library containing nearly all possible aminoacid combinations at positions 39, 40, 41 and 54 was selected for binding to the constant fraction of IgG using an mRNA display approach. Thus, by sequencing the mRNA library before and after selection, we can obtain estimates of how strongly each sequence in the library binds its target.
Data preprocessing
We take the number of times a sequence is observed in the input and selected samples as raw data for our purposes as provided by the authors here.
[22]:
gb1 = DataSet('gb1')
counts = gb1.raw_data
counts
[22]:
| input | selected | |
|---|---|---|
| sequence | ||
| VDGV | 92735 | 338346 |
| ADGV | 34 | 43 |
| CDGV | 850 | 641 |
| DDGV | 63 | 63 |
| EDGV | 841 | 190 |
| ... | ... | ... |
| YYYR | 203 | 1 |
| YYYS | 186 | 3 |
| YYYT | 181 | 14 |
| YYYW | 30 | 1 |
| YYYY | 57 | 2 |
149361 rows × 2 columns
We first filter sequences that are poorly represented in the input library (<20 sequencing counts).
[23]:
counts = counts.loc[counts['input'] >= 20, :]
counts
[23]:
| input | selected | |
|---|---|---|
| sequence | ||
| VDGV | 92735 | 338346 |
| ADGV | 34 | 43 |
| CDGV | 850 | 641 |
| DDGV | 63 | 63 |
| EDGV | 841 | 190 |
| ... | ... | ... |
| YYYR | 203 | 1 |
| YYYS | 186 | 3 |
| YYYT | 181 | 14 |
| YYYW | 30 | 1 |
| YYYY | 57 | 2 |
142820 rows × 2 columns
And use Enrich2 approach to derive the enrichment score and the associated error.
[24]:
wt = 'VDGV'
c_wt_input, c_wt_sel = counts.loc[wt]
X = counts.index.values
y = (np.log((counts['selected'] + 0.5)/(c_wt_sel + 0.5)) - np.log((counts['input'] + 0.5)/(c_wt_input + 0.5))).values
y_var = (1/(counts['input'] + 0.5) + 1/(counts['selected'] + 0.5) + 1/(c_wt_sel + 0.5) + 1/(c_wt_input + 0.5)).values
data = pd.DataFrame({'y': y, 'y_var': y_var}, index=X)
To be able to evaluate model performance, we first split the data into training and test sets keeping approximately 90% of the data for training and 10% for testing
[25]:
u = np.random.uniform(size=y.shape[0]) < 0.9
X_train, y_train, y_var_train = X[u], y[u], y_var[u]
X_test, y_test = X[~u], y[~u]
Computing the minumum epistasis interpolation solution
[26]:
model = MinimumEpistasisInterpolator(P=2, seq_length=4, alphabet_type='protein')
model.set_data(X_train, y_train)
mei = model.predict()
mei
[26]:
| f | |
|---|---|
| AAAA | 0.460831 |
| AAAC | -2.478059 |
| AAAD | -3.579738 |
| AAAE | -4.836451 |
| AAAF | -3.947928 |
| ... | ... |
| YYYS | -5.269987 |
| YYYT | -3.821426 |
| YYYV | -3.143536 |
| YYYW | -4.306581 |
| YYYY | -4.429813 |
160000 rows × 1 columns
[27]:
lim = (-9, 3)
r2 = pearsonr(mei.loc[X_test, 'f'], y_test)[0] ** 2
bins = np.linspace(lim[0] + 0.5, lim[1] - 0.5, 100)
fig, axes = plt.subplots(1, 1, figsize=(4, 4))
axes.hist2d(x=mei.loc[X_test, 'f'], y=y_test, cmap="binary", bins=bins)
axes.axline((0, 0), (1, 1), lw=0.5, linestyle="--", c="grey")
axes.set(
xlabel=r"$y_{pred}$",
ylabel=r"$y_{obs}$",
xlim=(lim[0] + 0.5, lim[1] - 0.5),
ylim=(lim[0] + 0.5, lim[1] - 0.5),
)
axes.text(
0.05,
0.95,
r"$R^2$=" + "{:.3f}".format(r2),
transform=axes.transAxes,
color="black",
ha="left",
va="top",
)
[27]:
Text(0.05, 0.95, '$R^2$=0.833')
Estimating Variance Components
[28]:
model = VCregression(seq_length=4, alphabet_type='protein')
model.fit(X=X_train, y=y_train, y_var=y_var_train)
[29]:
lambdas = model.lambdas
x = np.arange(0, lambdas.shape[0])
variance = model.lambdas_to_variance(lambdas)
p_variance = variance / variance.sum() * 100
[30]:
fig, subplots = init_fig(1, 2, colsize=3.5, rowsize=3.5)
axes = subplots[0]
axes.bar(x=x, height=lambdas, color='black')
axes.set(xlabel='$k$', ylabel='$\lambda_k$', yscale='log')
axes = subplots[1]
axes.bar(x=x[1:], height=p_variance, color='black')
axes.set(xlabel='$k$', ylabel='% variance explained', ylim=(0, 100), xticks=[1, 2, 3])
fig.tight_layout()
Inferring the complete GB1 landscape
Now that we have inferred the variance components, we obtain inferences for the complete combinatorial landscape and check the prediction accuracy for the held-out test data. Moreover, we can also see the distribution of phenotypic values within and outside the data
[31]:
pred = model.predict().join(data)
pred
[31]:
| f | y | y_var | |
|---|---|---|---|
| AAAA | 0.632888 | 0.460831 | 0.046009 |
| AAAC | -2.361377 | NaN | NaN |
| AAAD | -3.231004 | NaN | NaN |
| AAAE | -4.600692 | NaN | NaN |
| AAAF | -3.443787 | NaN | NaN |
| ... | ... | ... | ... |
| YYYS | -4.954946 | -5.269987 | 0.291090 |
| YYYT | -3.854892 | -3.821426 | 0.074489 |
| YYYV | -3.438228 | -3.143536 | 0.074682 |
| YYYW | -4.949506 | -4.306581 | 0.699467 |
| YYYY | -5.092643 | -4.429813 | 0.417405 |
160000 rows × 3 columns
[32]:
test = pred.loc[X_test, :]
r2 = pearsonr(test['f'], test['y'])[0] ** 2
lim = (-9, 3)
bins = np.linspace(lim[0] + 0.5, lim[1] - 0.5, 100)
fig, subplots = plt.subplots(1, 2, figsize=(8, 4))
axes = subplots[0]
axes.hist2d(x=test['f'], y=test['y'], cmap='binary', bins=bins)
axes.axline((0, 0), (1, 1), lw=0.5, linestyle='--', c='grey')
axes.set(xlabel=r'$y_{pred}$', ylabel=r'$y_{obs}$',
xlim=(lim[0] + 0.5, lim[1] - 0.5),
ylim=(lim[0] + 0.5, lim[1] - 0.5))
axes.text(0.05, 0.95, r'$R^2$=' + '{:.3f}'.format(r2),
transform=axes.transAxes, color='black', ha='left', va='top')
axes = subplots[1]
axes.hist(pred.dropna()['f'], label='Observed', alpha=0.3, density=True, bins=bins, color='black')
axes.hist(pred.loc[np.isnan(pred['y']), 'f'], label='New', alpha=0.3, density=True, bins=bins, color='grey')
axes.set(xlabel=r'$y_{pred}$', ylabel='% of sequences')
axes.legend(loc=0)
fig.tight_layout()
Estimating local epistatic coefficients under the prior
As previously shown, we are not only restricted to making calibrated predictions for previously uncharacterized or held out sequences, but we can also get calibrated estimates for linear combinations of them and use them to answer questions about specific mutational effects and how they change as we introduced more mutations.
One of the main interactions in the GB1 dataset is that taking place between positions 41 and 54. In the wild-type sequence VDGV, G41L is highly deleterious, but becomes advantageous in the presence of V54G. Now, we can use our model to compute the posterior distribution for this epistatic coefficient given the whole dataset.
We can select the 4 relevant sequences and show their measurement values
[33]:
seqs = ['VDGV', 'VDGG', 'VDLV', 'VDLG']
df = data.loc[seqs, :]
df
[33]:
| y | y_var | |
|---|---|---|
| VDGV | 0.000000 | 0.000027 |
| VDGG | 0.334153 | 0.003038 |
| VDLV | -3.512487 | 0.105615 |
| VDLG | 1.140357 | 0.013197 |
We then define the contrast matrices for the coefficients of interest: the single point mutational effects as well as the epistatic coefficient between them.
[34]:
contrast_matrix = pd.DataFrame({'G41L' : [-1, 0, 1, 0],
'V54G' : [-1, 1, 0, 0],
'G41L:V54G': [ 1, -1, -1, 1]},
index=seqs)
contrast = model.make_contrasts(contrast_matrix)
contrast
100%|██████████| 3/3 [01:08<00:00, 23.00s/it]
[34]:
| estimate | std | ci_95_lower | ci_95_upper | p(|x|>0) | |
|---|---|---|---|---|---|
| G41L | -3.497472 | 0.276578 | -4.050627 | -2.944317 | 1.000000 |
| V54G | -1.094177 | 0.515773 | -2.125724 | -0.062630 | 0.983057 |
| G41L:V54G | 5.739408 | 0.593206 | 4.552996 | 6.925820 | 1.000000 |
We can see that there is strong support in the posterior for the deleterious effect of G41L in the wild-type background, but also that this is strongly reversed in the presence of V54G. Lets now compare these estimates with the naive comparisons using the estimated measurement errors alone
[35]:
make_contrasts(df, contrast_matrix)
[35]:
| estimate | std | ci_95_lower | ci_95_upper | p(|x|>0) | |
|---|---|---|---|---|---|
| G41L | -3.512487 | 0.325026 | -4.162539 | -2.862435 | 1.0 |
| V54G | 0.334153 | 0.055370 | 0.223413 | 0.444892 | 1.0 |
| G41L:V54G | 4.318691 | 0.349109 | 3.620472 | 5.016910 | 1.0 |
We can see how estimates are then shrunk towards 0 and uncertaintly is slightly reduced by using the prior that was estimated across the complete dataset