Summarizing complex landscapes

In the previous section, we illustrate of how to infer a complete combinatorial landscape from empirical data using either VCregression or SeqDEFT. In this section, we illustrate how to obtain low-dimensional summary statistics of the inferred landscapes that characterize the complexity and patterns of genetic interactions across the different sequence positions. We do so by computing the variance explained by genetic interactions of different orders, involving each possible site and pairs of sites.

GB1 protein landscape

In this section, we first show how we can use gpmap-tools to understand the complexity and patterns of genetic interactions across multiple sites in a previously inferred genotype-phenotype map (GB1) using VCregression and previously published high throughput experimental data.

[46]:

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from gpmap.datasets import DataSet
from gpmap.summary import GPmapSummarizer

We use the inferred complete combinatorial landscape as provided in DataSets

[8]:

gb1 = DataSet('gb1')
gb1.landscape

[8]:

	y
seq
AAAA	0.296301
AAAC	-2.713474
AAAD	-2.912992
AAAE	-4.548719
AAAF	-3.276738
...	...
YYYS	-4.662925
YYYT	-3.223102
YYYV	-3.001718
YYYW	-4.723318
YYYY	-4.876429

160000 rows × 1 columns

Firts, we define a GPmapSummarizer object with the configuration of sequence space and the associated functional values for every possible sequence f.

Note: Function values must be provided for sequences in alphabetical order.

[9]:

gb1s = GPmapSummarizer(n_alleles=20, seq_length=4, f=gb1.landscape['y'].values)

How to compute the variance components of interactions of order `k`

Now, we only need to call the method calc_V_k_variance_components to obtain a pd.DataFrame with the total and percentage variance explained by interactions of every possible order

[10]:

v_k_vcs = gb1s.calc_V_k_variance_components()
v_k_vcs

[10]:

	k	variance	variance_perc	variance_perc_cum
0	1	180067.068327	52.811580	52.811580
1	2	128538.741722	37.698921	90.510501
2	3	31449.759748	9.223849	99.734350
3	4	905.762731	0.265650	100.000000

which can be easily represented graphically as follows

[18]:

fig, axes = plt.subplots(1, 1, figsize=(3, 3))

# Plot percentage variance explained
axes.bar(x=v_k_vcs["k"], height=v_k_vcs["variance_perc"], color="black")
axes.set(
    xlabel="Interaction order $k$",
    ylabel="% variance explained",
    xticks=np.arange(1, 10),
    ylim=(0, 100),
)

# Plot cumulative percentage variance explained
axes = axes.twinx()
axes.scatter(v_k_vcs["k"], v_k_vcs["variance_perc_cum"], color="grey", s=25)
axes.plot(v_k_vcs["k"], v_k_vcs["variance_perc_cum"], color="grey", lw=2)
axes.tick_params(axis="y", colors="grey")
axes.spines["right"].set_color("grey")
axes.set(ylim=(0, 105))
axes.set_ylabel("% cumulative variance", color="grey")
fig.tight_layout()

../_images/usage_summary_statistics_9_0.png

However, this only provides information about the complexity of the landscape, but not about which sites are involved in more or less complex genetic interactions and whether interactions are widespread or happening between specific sets of sites.

How to compute the variance components of interactions involving a subsets of sites \(U\)

This can be done by calling the method calc_V_U_variance_components to obtain a pd.DataFrame with the total and percentage variance explained involving every possible combination of sites

[19]:

v_U_vcs = gb1s.calc_V_U_variance_components()
v_U_vcs

[19]:

	U	k	variance	variance_perc	variance_perc_cum
0	{0}	1	28303.313131	8.301033	8.301033
1	{1}	1	16149.051370	4.736329	13.037362
2	{2}	1	77458.399775	22.717649	35.755012
3	{3}	1	58156.304052	17.056569	52.811580
4	{0, 1}	2	5138.287598	1.507000	54.318581
5	{0, 2}	2	25776.797274	7.560035	61.878616
6	{0, 3}	2	17682.891637	5.186187	67.064803
7	{1, 2}	2	8761.142103	2.569541	69.634344
8	{1, 3}	2	5422.459340	1.590344	71.224688
9	{2, 3}	2	65757.163770	19.285813	90.510501
10	{0, 1, 2}	3	3723.022582	1.091919	91.602420
11	{0, 1, 3}	3	2675.360421	0.784652	92.387072
12	{0, 2, 3}	3	18786.824440	5.509957	97.897030
13	{1, 2, 3}	3	6264.552305	1.837320	99.734350
14	{0, 1, 2, 3}	4	905.762731	0.265650	100.000000

Whereas the number of possible combinations is relatively small in this case, it scales exponentially with sequence length, making it in practical in other cases to interpret the values in this DataFrame directly. Still, we can compute some useful aggregated statistics that we can represent easily as heatmaps.

How to compute the variance explained by interactions of order `k` involving each site

calc_sites_variance_perc enables aggregation of these variances by whether they involve each possible site and the total number of sites involved in those genetic interactions.

[21]:

sites = gb1s.calc_sites_variance_perc(v_U_vcs).T
sites

[21]:

	1	2	3	4
0	8.301033	14.253222	7.386529	0.26565
1	4.736329	5.666886	3.713892	0.26565
2	22.717649	29.415389	8.439197	0.26565
3	17.056569	26.062344	8.131930	0.26565

[44]:

fig, axes = plt.subplots(1, 1, figsize=(3, 3))
im = axes.imshow(
    sites.T.iloc[::-1, :],
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=[0, 1, 2, 3],
    xticklabels=[39, 40, 41, 54],
    xlabel="Site",
    yticks=[0, 1, 2, 3],
    yticklabels=[1, 2, 3, 4][::-1],
    ylabel="Interaction order $k$",
    aspect='equal'
)
plt.colorbar(im, shrink=0.6, label='% variance explained')
fig.tight_layout()

../_images/usage_summary_statistics_14_0.png

This clearly shows that some sites (41 and 54) have a much larger contribution to the phenotypic variance than others (40), and that they contribute mostly through their additive, pairwise and even threeway interactions.

How to compute the variance explained by interactions between every pair of sites

Another interesting way to aggregate the variance explain by every subset of sites \(U\) is by computing the variance explain by interactions of any order involving a pair of sites.

[45]:

pairs = gb1s.calc_site_pairs_variance_perc(v_U_vcs)
pairs

[45]:

	site1	site2	variance	variance_perc
0	0	1	12442.433333	7.733298
1	0	2	49192.407027	30.574370
2	0	3	40050.839230	24.892646
3	1	2	19654.479720	12.215774
4	1	3	15268.134797	9.489546
5	2	3	91714.303247	57.002842

We can arrange these values on a matrix and represent them in a heatmap

[54]:

m = pd.pivot_table(
    pairs, index="site1", columns="site2", values="variance_perc"
)
m

[54]:

site2	1	2	3
site1
0	7.733298	30.574370	24.892646
1	NaN	12.215774	9.489546
2	NaN	NaN	57.002842

[57]:

fig, axes = plt.subplots(1, 1, figsize=(3, 3))
im = axes.imshow(
    m,
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=[0, 1, 2],
    xlabel="Site 1",
    yticks=[0, 1, 2],
    xticklabels=[40, 41, 54],
    yticklabels=[39, 40, 41],
    ylabel="Site 2",
    aspect="equal",
)
plt.colorbar(im, shrink=0.6, label="% variance explained")
fig.tight_layout()

../_images/usage_summary_statistics_19_0.png

This plot clearly shows that sites 41 and 54 interact strongly with each other, but also with position 39.

Shine-Dalgarno fitness landscape

In the previous example, the total number of subsets of sites \(U\) is still relatively small given the short sequence length. To better illustrate the usefulness of this approach, we will use a 9-nucleotide long complete combinatorial landscape inferred with VCregression from previously published high throughput experimental data in the dmsC gene context. The inferred landscape is also available as a DataSet for illustration purposes.

[58]:

sd = DataSet('dmsc')
sd.landscape

[58]:

	f
AAAAAAAAA	0.563857
AAAAAAAAC	0.701511
AAAAAAAAG	0.619188
AAAAAAAAU	0.628772
AAAAAAACA	0.594683
...	...
UUUUUUUGU	0.615910
UUUUUUUUA	0.539058
UUUUUUUUC	0.548782
UUUUUUUUG	0.536610
UUUUUUUUU	0.564960

262144 rows × 1 columns

Lets start again by defining the GPmapSummarizer object and compute the variance explained by every possible subset of sites

[59]:

sds = GPmapSummarizer(n_alleles=4, seq_length=9, f=sd.landscape["f"].values)

[60]:

v_U_vcs = sds.calc_V_U_variance_components()
v_U_vcs

[60]:

	U	k	variance	variance_perc	variance_perc_cum
0	{0}	1	1574.656812	5.500762	5.500762
1	{1}	1	2687.622462	9.388695	14.889458
2	{2}	1	2312.377111	8.077847	22.967305
3	{3}	1	1806.662119	6.311229	29.278533
4	{4}	1	1051.784760	3.674209	32.952742
...	...	...	...	...	...
506	{0, 1, 2, 4, 5, 6, 7, 8}	8	2.485503	0.008683	99.956748
507	{0, 1, 3, 4, 5, 6, 7, 8}	8	2.545558	0.008892	99.965641
508	{0, 2, 3, 4, 5, 6, 7, 8}	8	2.358090	0.008238	99.973878
509	{1, 2, 3, 4, 5, 6, 7, 8}	8	2.796804	0.009770	99.983648
510	{0, 1, 2, 3, 4, 5, 6, 7, 8}	9	4.680847	0.016352	100.000000

511 rows × 5 columns

We can see that the number of components now is much larger than before, so we really need to aggregate these components in a meaningful way to obtain low-dimensional summary statistics reflecting the patterns of genetic interactions in the inferred genotype-phenotype map.

How to compute the variance explained by interactions of order `k` involving each site

[61]:

sites = sds.calc_sites_variance_perc(v_U_vcs)
sites

[61]:

	0	1	2	3	4	5	6	7	8
1	5.500762	9.388695	8.077847	6.311229	3.674209	2.277808	1.837005	0.835968	0.399951
2	6.390370	10.088863	10.262287	8.163453	6.898441	4.512101	3.034212	1.908863	1.153861
3	5.799007	8.710946	10.348244	11.095941	10.542178	8.887247	5.887798	3.522035	2.133580
4	2.999159	4.047676	4.776184	5.714224	5.930684	5.423643	4.125461	2.856694	1.932568
5	1.208228	1.508030	1.640607	1.763638	1.809694	1.735487	1.502939	1.196867	0.925827
6	0.441740	0.488117	0.516701	0.522051	0.525214	0.525015	0.496194	0.446754	0.389411
7	0.187551	0.194456	0.197175	0.198214	0.199869	0.199935	0.199599	0.192142	0.179375
8	0.070391	0.071923	0.071268	0.071478	0.071555	0.071555	0.071691	0.071658	0.069766
9	0.016352	0.016352	0.016352	0.016352	0.016352	0.016352	0.016352	0.016352	0.016352

[65]:

orders = np.arange(1, 10)
positions = np.arange(-13, -4)
ticks = np.arange(9)

fig, axes = plt.subplots(1, 1, figsize=(4, 4))
im = axes.imshow(
    sites.iloc[::-1, :],
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=ticks,
    xticklabels=positions,
    xlabel="Site",
    yticks=ticks,
    yticklabels=orders[::-1],
    ylabel="Interaction order $k$",
    aspect="equal",
)
plt.colorbar(im, shrink=0.6, label="% variance explained")
fig.tight_layout()

../_images/usage_summary_statistics_27_0.png

This plot shows that whereas some positions have large additive and pairwise contributions, other sites have more context-dependent contributions to phenotypic variation, with the largest contributions arising from interactions of order 3 or higher.

How to compute the variance explained by pairwise and higher order interactions between every pair of sites

In this example, we do an finer aggregation and separate the variance explained by pairwise and higher order interactions between pairs of sites on the upper and lower triangular parts of the heatmap representation.

[88]:

pairs_pw = sds.calc_site_pairs_variance_perc(v_U_vcs.loc[v_U_vcs['k'] == 2, :])
pairs_ho = sds.calc_site_pairs_variance_perc(v_U_vcs, min_k=3)

m_bottom = pairs_pw.pivot(index="site1", columns="site2", values="variance_perc")
m_bottom = m_bottom.reindex(ticks).T.reindex(ticks).fillna(0)
m_top = pairs_ho.pivot(index='site1', columns='site2', values='variance_perc')
m_top = m_top.reindex(ticks).T.reindex(ticks).fillna(0).T
m = m_top + m_bottom
m

[88]:

site2	0	1	2	3	4	5	6	7	8
site1
0	0.000000	20.235400	19.297067	12.736994	8.528527	7.706634	5.887429	4.593244	3.814681
1	8.562998	0.000000	28.937652	22.225871	13.779223	9.638418	7.624355	5.565630	4.245642
2	9.734242	13.710318	0.000000	28.779777	20.837386	13.780875	7.749691	5.786126	4.398405
3	1.380292	9.409153	6.562947	0.000000	31.867450	23.175051	12.769771	6.189255	5.449330
4	1.371878	2.865711	4.951972	7.331872	0.000000	31.061032	19.615363	10.645316	6.161446
5	1.202204	1.320697	2.437803	3.856483	4.213326	0.000000	21.838515	13.053678	7.791366
6	1.367760	1.333584	0.959993	1.554637	3.281415	1.401816	0.000000	13.716746	7.937204
7	0.635882	0.887281	0.527592	0.344046	1.860773	1.435976	1.094349	0.000000	9.259406
8	0.129675	0.408224	0.274862	0.711385	0.446727	1.349362	0.584660	0.498108	0.000000

[89]:

fig, axes = plt.subplots(1, 1, figsize=(4, 4))
im = axes.imshow(
    m,
    cmap="Greys",
    vmin=0,
)
axes.set(
    xticks=ticks,
    xlabel="Site 1",
    yticks=ticks,
    xticklabels=positions,
    yticklabels=positions,
    ylabel="Site 2",
    aspect="equal",
)
plt.colorbar(im, shrink=0.6, label="% variance explained")
fig.tight_layout()

../_images/usage_summary_statistics_30_0.png

Overall, this summary plots show that positions in the Shine-Dalgarno sequence tend to interact mostly with adjacent positions. Moreover, higher order interactions can involve sites that are further appart from each other in the primary sequence.

Summarizing complex landscapes

GB1 protein landscape

How to compute the variance components of interactions of order k

How to compute the variance components of interactions involving a subsets of sites \(U\)

How to compute the variance explained by interactions of order k involving each site

How to compute the variance explained by interactions between every pair of sites

Shine-Dalgarno fitness landscape

How to compute the variance explained by interactions of order k involving each site

How to compute the variance explained by pairwise and higher order interactions between every pair of sites

How to compute the variance components of interactions of order `k`

How to compute the variance explained by interactions of order `k` involving each site

How to compute the variance explained by interactions of order `k` involving each site