Summarizing complex landscapes

In the previous section, we illustrate of how to infer a complete combinatorial landscape from empirical data using either VCregression or SeqDEFT. In this section, we illustrate how to obtain low-dimensional summary statistics of the inferred landscapes that characterize the complexity and patterns of genetic interactions across the different sequence positions. We do so by computing the variance explained by genetic interactions of different orders, involving each possible site and pairs of sites.

GB1 protein landscape

In this section, we first show how we can use gpmap-tools to understand the complexity and patterns of genetic interactions across multiple sites in a previously inferred genotype-phenotype map (GB1) using VCregression and previously published high throughput experimental data.

[46]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from gpmap.datasets import DataSet
from gpmap.summary import GPmapSummarizer

We use the inferred complete combinatorial landscape as provided in DataSets

[8]:
gb1 = DataSet('gb1')
gb1.landscape
[8]:
y
seq
AAAA 0.296301
AAAC -2.713474
AAAD -2.912992
AAAE -4.548719
AAAF -3.276738
... ...
YYYS -4.662925
YYYT -3.223102
YYYV -3.001718
YYYW -4.723318
YYYY -4.876429

160000 rows × 1 columns

Firts, we define a GPmapSummarizer object with the configuration of sequence space and the associated functional values for every possible sequence f.

Note: Function values must be provided for sequences in alphabetical order.

[9]:
gb1s = GPmapSummarizer(n_alleles=20, seq_length=4, f=gb1.landscape['y'].values)

How to compute the variance components of interactions of order k

Now, we only need to call the method calc_V_k_variance_components to obtain a pd.DataFrame with the total and percentage variance explained by interactions of every possible order

[10]:
v_k_vcs = gb1s.calc_V_k_variance_components()
v_k_vcs
[10]:
k variance variance_perc variance_perc_cum
0 1 180067.068327 52.811580 52.811580
1 2 128538.741722 37.698921 90.510501
2 3 31449.759748 9.223849 99.734350
3 4 905.762731 0.265650 100.000000

which can be easily represented graphically as follows

[18]:
fig, axes = plt.subplots(1, 1, figsize=(3, 3))

# Plot percentage variance explained
axes.bar(x=v_k_vcs["k"], height=v_k_vcs["variance_perc"], color="black")
axes.set(
    xlabel="Interaction order $k$",
    ylabel="% variance explained",
    xticks=np.arange(1, 10),
    ylim=(0, 100),
)

# Plot cumulative percentage variance explained
axes = axes.twinx()
axes.scatter(v_k_vcs["k"], v_k_vcs["variance_perc_cum"], color="grey", s=25)
axes.plot(v_k_vcs["k"], v_k_vcs["variance_perc_cum"], color="grey", lw=2)
axes.tick_params(axis="y", colors="grey")
axes.spines["right"].set_color("grey")
axes.set(ylim=(0, 105))
axes.set_ylabel("% cumulative variance", color="grey")
fig.tight_layout()
../_images/usage_summary_statistics_9_0.png

However, this only provides information about the complexity of the landscape, but not about which sites are involved in more or less complex genetic interactions and whether interactions are widespread or happening between specific sets of sites.

How to compute the variance components of interactions involving a subsets of sites \(U\)

This can be done by calling the method calc_V_U_variance_components to obtain a pd.DataFrame with the total and percentage variance explained involving every possible combination of sites

[19]:
v_U_vcs = gb1s.calc_V_U_variance_components()
v_U_vcs

[19]:
U k variance variance_perc variance_perc_cum
0 {0} 1 28303.313131 8.301033 8.301033
1 {1} 1 16149.051370 4.736329 13.037362
2 {2} 1 77458.399775 22.717649 35.755012
3 {3} 1 58156.304052 17.056569 52.811580
4 {0, 1} 2 5138.287598 1.507000 54.318581
5 {0, 2} 2 25776.797274 7.560035 61.878616
6 {0, 3} 2 17682.891637 5.186187 67.064803
7 {1, 2} 2 8761.142103 2.569541 69.634344
8 {1, 3} 2 5422.459340 1.590344 71.224688
9 {2, 3} 2 65757.163770 19.285813 90.510501
10 {0, 1, 2} 3 3723.022582 1.091919 91.602420
11 {0, 1, 3} 3 2675.360421 0.784652 92.387072
12 {0, 2, 3} 3 18786.824440 5.509957 97.897030
13 {1, 2, 3} 3 6264.552305 1.837320 99.734350
14 {0, 1, 2, 3} 4 905.762731 0.265650 100.000000

Whereas the number of possible combinations is relatively small in this case, it scales exponentially with sequence length, making it in practical in other cases to interpret the values in this DataFrame directly. Still, we can compute some useful aggregated statistics that we can represent easily as heatmaps.

How to compute the variance explained by interactions of order k involving each site

calc_sites_variance_perc enables aggregation of these variances by whether they involve each possible site and the total number of sites involved in those genetic interactions.

[21]:
sites = gb1s.calc_sites_variance_perc(v_U_vcs).T
sites
[21]:
1 2 3 4
0 8.301033 14.253222 7.386529 0.26565
1 4.736329 5.666886 3.713892 0.26565
2 22.717649 29.415389 8.439197 0.26565
3 17.056569 26.062344 8.131930 0.26565
[44]:
fig, axes = plt.subplots(1, 1, figsize=(3, 3))
im = axes.imshow(
    sites.T.iloc[::-1, :],
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=[0, 1, 2, 3],
    xticklabels=[39, 40, 41, 54],
    xlabel="Site",
    yticks=[0, 1, 2, 3],
    yticklabels=[1, 2, 3, 4][::-1],
    ylabel="Interaction order $k$",
    aspect='equal'
)
plt.colorbar(im, shrink=0.6, label='% variance explained')
fig.tight_layout()
../_images/usage_summary_statistics_14_0.png

This clearly shows that some sites (41 and 54) have a much larger contribution to the phenotypic variance than others (40), and that they contribute mostly through their additive, pairwise and even threeway interactions.

How to compute the variance explained by interactions between every pair of sites

Another interesting way to aggregate the variance explain by every subset of sites \(U\) is by computing the variance explain by interactions of any order involving a pair of sites.

[45]:
pairs = gb1s.calc_site_pairs_variance_perc(v_U_vcs)
pairs

[45]:
site1 site2 variance variance_perc
0 0 1 12442.433333 7.733298
1 0 2 49192.407027 30.574370
2 0 3 40050.839230 24.892646
3 1 2 19654.479720 12.215774
4 1 3 15268.134797 9.489546
5 2 3 91714.303247 57.002842

We can arrange these values on a matrix and represent them in a heatmap

[54]:
m = pd.pivot_table(
    pairs, index="site1", columns="site2", values="variance_perc"
)
m
[54]:
site2 1 2 3
site1
0 7.733298 30.574370 24.892646
1 NaN 12.215774 9.489546
2 NaN NaN 57.002842
[57]:
fig, axes = plt.subplots(1, 1, figsize=(3, 3))
im = axes.imshow(
    m,
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=[0, 1, 2],
    xlabel="Site 1",
    yticks=[0, 1, 2],
    xticklabels=[40, 41, 54],
    yticklabels=[39, 40, 41],
    ylabel="Site 2",
    aspect="equal",
)
plt.colorbar(im, shrink=0.6, label="% variance explained")
fig.tight_layout()
../_images/usage_summary_statistics_19_0.png

This plot clearly shows that sites 41 and 54 interact strongly with each other, but also with position 39.

Shine-Dalgarno fitness landscape

In the previous example, the total number of subsets of sites \(U\) is still relatively small given the short sequence length. To better illustrate the usefulness of this approach, we will use a 9-nucleotide long complete combinatorial landscape inferred with VCregression from previously published high throughput experimental data in the dmsC gene context. The inferred landscape is also available as a DataSet for illustration purposes.

[58]:
sd = DataSet('dmsc')
sd.landscape
[58]:
f
AAAAAAAAA 0.563857
AAAAAAAAC 0.701511
AAAAAAAAG 0.619188
AAAAAAAAU 0.628772
AAAAAAACA 0.594683
... ...
UUUUUUUGU 0.615910
UUUUUUUUA 0.539058
UUUUUUUUC 0.548782
UUUUUUUUG 0.536610
UUUUUUUUU 0.564960

262144 rows × 1 columns

Lets start again by defining the GPmapSummarizer object and compute the variance explained by every possible subset of sites

[59]:
sds = GPmapSummarizer(n_alleles=4, seq_length=9, f=sd.landscape["f"].values)
[60]:
v_U_vcs = sds.calc_V_U_variance_components()
v_U_vcs
[60]:
U k variance variance_perc variance_perc_cum
0 {0} 1 1574.656812 5.500762 5.500762
1 {1} 1 2687.622462 9.388695 14.889458
2 {2} 1 2312.377111 8.077847 22.967305
3 {3} 1 1806.662119 6.311229 29.278533
4 {4} 1 1051.784760 3.674209 32.952742
... ... ... ... ... ...
506 {0, 1, 2, 4, 5, 6, 7, 8} 8 2.485503 0.008683 99.956748
507 {0, 1, 3, 4, 5, 6, 7, 8} 8 2.545558 0.008892 99.965641
508 {0, 2, 3, 4, 5, 6, 7, 8} 8 2.358090 0.008238 99.973878
509 {1, 2, 3, 4, 5, 6, 7, 8} 8 2.796804 0.009770 99.983648
510 {0, 1, 2, 3, 4, 5, 6, 7, 8} 9 4.680847 0.016352 100.000000

511 rows × 5 columns

We can see that the number of components now is much larger than before, so we really need to aggregate these components in a meaningful way to obtain low-dimensional summary statistics reflecting the patterns of genetic interactions in the inferred genotype-phenotype map.

How to compute the variance explained by interactions of order k involving each site

[61]:
sites = sds.calc_sites_variance_perc(v_U_vcs)
sites
[61]:
0 1 2 3 4 5 6 7 8
1 5.500762 9.388695 8.077847 6.311229 3.674209 2.277808 1.837005 0.835968 0.399951
2 6.390370 10.088863 10.262287 8.163453 6.898441 4.512101 3.034212 1.908863 1.153861
3 5.799007 8.710946 10.348244 11.095941 10.542178 8.887247 5.887798 3.522035 2.133580
4 2.999159 4.047676 4.776184 5.714224 5.930684 5.423643 4.125461 2.856694 1.932568
5 1.208228 1.508030 1.640607 1.763638 1.809694 1.735487 1.502939 1.196867 0.925827
6 0.441740 0.488117 0.516701 0.522051 0.525214 0.525015 0.496194 0.446754 0.389411
7 0.187551 0.194456 0.197175 0.198214 0.199869 0.199935 0.199599 0.192142 0.179375
8 0.070391 0.071923 0.071268 0.071478 0.071555 0.071555 0.071691 0.071658 0.069766
9 0.016352 0.016352 0.016352 0.016352 0.016352 0.016352 0.016352 0.016352 0.016352
[65]:
orders = np.arange(1, 10)
positions = np.arange(-13, -4)
ticks = np.arange(9)

fig, axes = plt.subplots(1, 1, figsize=(4, 4))
im = axes.imshow(
    sites.iloc[::-1, :],
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=ticks,
    xticklabels=positions,
    xlabel="Site",
    yticks=ticks,
    yticklabels=orders[::-1],
    ylabel="Interaction order $k$",
    aspect="equal",
)
plt.colorbar(im, shrink=0.6, label="% variance explained")
fig.tight_layout()

../_images/usage_summary_statistics_27_0.png

This plot shows that whereas some positions have large additive and pairwise contributions, other sites have more context-dependent contributions to phenotypic variation, with the largest contributions arising from interactions of order 3 or higher.

How to compute the variance explained by pairwise and higher order interactions between every pair of sites

In this example, we do an finer aggregation and separate the variance explained by pairwise and higher order interactions between pairs of sites on the upper and lower triangular parts of the heatmap representation.

[88]:
pairs_pw = sds.calc_site_pairs_variance_perc(v_U_vcs.loc[v_U_vcs['k'] == 2, :])
pairs_ho = sds.calc_site_pairs_variance_perc(v_U_vcs, min_k=3)

m_bottom = pairs_pw.pivot(index="site1", columns="site2", values="variance_perc")
m_bottom = m_bottom.reindex(ticks).T.reindex(ticks).fillna(0)
m_top = pairs_ho.pivot(index='site1', columns='site2', values='variance_perc')
m_top = m_top.reindex(ticks).T.reindex(ticks).fillna(0).T
m = m_top + m_bottom
m
[88]:
site2 0 1 2 3 4 5 6 7 8
site1
0 0.000000 20.235400 19.297067 12.736994 8.528527 7.706634 5.887429 4.593244 3.814681
1 8.562998 0.000000 28.937652 22.225871 13.779223 9.638418 7.624355 5.565630 4.245642
2 9.734242 13.710318 0.000000 28.779777 20.837386 13.780875 7.749691 5.786126 4.398405
3 1.380292 9.409153 6.562947 0.000000 31.867450 23.175051 12.769771 6.189255 5.449330
4 1.371878 2.865711 4.951972 7.331872 0.000000 31.061032 19.615363 10.645316 6.161446
5 1.202204 1.320697 2.437803 3.856483 4.213326 0.000000 21.838515 13.053678 7.791366
6 1.367760 1.333584 0.959993 1.554637 3.281415 1.401816 0.000000 13.716746 7.937204
7 0.635882 0.887281 0.527592 0.344046 1.860773 1.435976 1.094349 0.000000 9.259406
8 0.129675 0.408224 0.274862 0.711385 0.446727 1.349362 0.584660 0.498108 0.000000
[89]:
fig, axes = plt.subplots(1, 1, figsize=(4, 4))
im = axes.imshow(
    m,
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=ticks,
    xlabel="Site 1",
    yticks=ticks,
    xticklabels=positions,
    yticklabels=positions,
    ylabel="Site 2",
    aspect="equal",
)
plt.colorbar(im, shrink=0.6, label="% variance explained")
fig.tight_layout()

../_images/usage_summary_statistics_30_0.png

Overall, this summary plots show that positions in the Shine-Dalgarno sequence tend to interact mostly with adjacent positions. Moreover, higher order interactions can involve sites that are further appart from each other in the primary sequence.