Datasets

Datasets consist of complete combinatorial landscapes to visualize and work with, as well as the data from which the were inferred. Both inference of the complete landscape and calculation of the coordinates of the visualization are precomputed to provide rapid access to the different layers of interest.

[1]:
# Import required libraries
import numpy as np

from gpmap.datasets import DataSet, list_available_datasets
from gpmap.inference import VCregression

How to load a built-in dataset

We include a series of datasets that are used throughout the documentation for demonstration of the different applications and are directly accessible after installation of the library for any user. The list of built-in datasets can be easily shown as follows

[2]:
list_available_datasets()
[2]:
['5ss', 'f1u', 'test', 'dmsc', 'gb1', 'smn1', 'serine', 'trna', 'pard']

How to access combinatorial landscape values

And one can easily load one of those datasets as illustrated in some previous tutorials, and all of them should contain at least a landscape attribute containing the phenotype associated to each possible genotype

[3]:
gb1 = DataSet('gb1')
gb1.landscape
[3]:
y
seq
AAAA 0.296301
AAAC -2.713474
AAAD -2.912992
AAAE -4.548719
AAAF -3.276738
... ...
YYYS -4.662925
YYYT -3.223102
YYYV -3.001718
YYYW -4.723318
YYYY -4.876429

160000 rows × 1 columns

How to access the processed data in experimental datasets

If the landscape was obtained from experimental data, then it also has a data attribute that includes the measurement y and, if available, its uncertainty y_var. The data may not necessarily include measurements for every possible sequence, as in this case, in which about ~10000 sequences were not experimentally measured

[4]:
gb1.data
[4]:
y y_var
sequence
AAAA 0.460831 0.046009
AAAG -2.192261 0.255906
AAAH -4.728306 2.064530
AAAI -4.338842 2.095252
AAAL -2.326240 0.087518
... ... ...
YYYS -5.269987 0.291090
YYYT -3.821426 0.074489
YYYV -3.143536 0.074682
YYYW -4.306581 0.699467
YYYY -4.429813 0.417405

149361 rows × 2 columns

How to access the a dataset visualization

For built-in datasets, we also provide the pre-calculated coordinates of the visualization, the DataFrame connecting sequences separated by single point mutations and the relaxation times associated to each of the diffusion axes in the attributes nodes, edges and relaxation_times

[5]:
gb1.nodes
[5]:
1 2 3 4 5 6 7 8 9 10 function stationary_freq
AAAA -0.270938 -0.944304 -0.227171 0.744803 0.059077 -0.077512 -0.477853 0.174491 0.015944 0.052664 0.296301 1.067767e-04
AAAC 0.033789 -0.232603 -0.271458 0.576487 0.035619 0.087608 0.590118 -0.249005 -0.087750 -0.110291 -2.713474 4.954648e-06
AAAD -0.020398 -0.127749 -0.174455 0.347843 0.142684 0.208679 0.590025 0.160819 0.354397 0.676487 -2.912992 4.042194e-06
AAAE -0.001018 -0.138712 -0.183161 0.340728 0.121067 0.157871 0.436407 0.195630 0.211100 0.364298 -4.548719 7.619345e-07
AAAF 0.149717 -0.156524 -0.239304 0.386243 0.103285 0.107756 0.302406 0.051575 0.171278 0.226772 -3.276738 2.789084e-06
... ... ... ... ... ... ... ... ... ... ... ... ...
YYYS 0.073880 0.038075 -0.097751 0.156184 0.056463 0.074291 0.262512 0.019037 0.144439 0.172686 -4.662925 6.781399e-07
YYYT -0.091125 0.213370 0.256403 0.246274 -0.086279 0.026923 0.217086 0.102111 0.593257 0.003682 -3.223102 2.945947e-06
YYYV 0.016488 0.195242 0.216320 0.035269 0.306726 0.025334 0.217759 -0.028542 -0.038378 0.148356 -3.001718 3.692393e-06
YYYW 0.134072 0.114107 -0.043856 0.011092 0.076565 0.109907 0.261274 0.108909 0.180371 0.365348 -4.723318 6.376209e-07
YYYY 0.086278 0.113188 -0.035062 0.017597 0.096670 0.249598 0.261107 0.108695 0.188331 0.368750 -4.876429 5.454157e-07

160000 rows × 12 columns

[6]:
gb1.edges
[6]:
i j
0 0 1
1 0 2
2 0 3
3 0 4
4 0 5
... ... ...
6079995 159996 159998
6079996 159996 159999
6079997 159997 159998
6079998 159997 159999
6079999 159998 159999

6080000 rows × 2 columns

[7]:
gb1.relaxation_times
[7]:
k decay_rates relaxation_time
0 1 2.554843 0.391413
1 2 3.566862 0.280359
2 3 4.926568 0.202981
3 4 5.023657 0.199058
4 5 5.303026 0.188572
5 6 5.635594 0.177444
6 7 6.294868 0.158860
7 8 6.543588 0.152821
8 9 6.741685 0.148331
9 10 7.000798 0.142841

How to build new datasets

We also provide functionality to create new datasets and store them in the local copy of your library for easier access. Lets build new datasets from simulated data

[8]:
np.random.seed(0)
lambdas = np.array([10, 2, 0.5, 0.1, 0.02, 0])
model = VCregression(seq_length=5, alphabet_type='dna', lambdas=lambdas)
data = model.simulate(p_missing=0.2, sigma=0.1).drop('y_true', axis=1).dropna()
data
[8]:
y y_var
AAAAA 0.039540 0.01
AAAAG -0.117862 0.01
AAAAT 0.303257 0.01
AAACA 0.230550 0.01
AAACC -0.000383 0.01
... ... ...
TTTCT -0.010906 0.01
TTTGG -0.398118 0.01
TTTTC -0.320779 0.01
TTTTG -0.266456 0.01
TTTTT -0.141016 0.01

814 rows × 2 columns

The method build will use some default values to run Variance Component regression and compute visualization coordinates automatically, but may not be the best choice for any particular dataset.

[9]:
test = DataSet('test', data=data)
test.build()
100%|██████████| 100/100 [00:02<00:00, 41.79it/s]

We can now re-load the dataset from disk and verify that it contains the visualization attributes

Note that reinstalling the library will erase the newly created DataSets

[10]:
test = DataSet('test')
test.nodes
[10]:
1 2 3 4 5 6 7 8 9 10 ... 13 14 15 16 17 18 19 20 function stationary_freq
AAAAA 2.922532 1.612996 1.922566 1.928836 0.651238 0.724700 -0.626236 0.457580 -0.160287 0.765350 ... 2.476916 3.135844 0.294568 1.121940 -0.204409 -1.407941 0.419596 -1.271204 -0.019814 0.000053
AAAAC 2.681952 1.434138 2.081914 0.677991 0.593148 1.109453 -0.824506 1.029326 -0.209692 0.677500 ... 1.768603 2.792991 0.789508 1.052783 0.499965 -1.928383 0.194958 -0.919782 -0.038036 0.000044
AAAAG 2.604061 1.908399 2.159282 0.320091 0.376623 0.861715 -0.115110 0.965854 0.537670 0.772480 ... 2.137501 2.527706 0.950794 1.133436 0.317646 -1.593851 0.628486 -1.566774 -0.120535 0.000020
AAAAT 1.960418 2.015446 2.802114 -0.144481 1.141420 2.506787 -1.358454 1.476034 -0.333192 0.398977 ... 1.890287 2.614852 2.739016 -0.219729 -0.199154 -1.568183 0.575084 -1.885819 0.220953 0.000543
AAACA 3.814800 1.261699 0.811080 4.448437 0.032213 -0.514836 -0.255077 -0.670186 -0.133688 0.566255 ... 1.577136 1.222345 -0.527213 0.709598 -0.291410 -1.466313 -0.167420 -1.182996 0.201557 0.000450
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTGT 2.140434 -0.394854 0.736442 -0.155903 -0.180521 1.415161 0.583885 0.179762 -0.547157 0.584356 ... 0.441516 0.353044 -0.189312 -0.864722 0.205254 0.092788 -0.209753 0.350808 -0.114952 0.000021
TTTTA 2.902003 0.551990 -0.275240 1.688366 -0.765170 0.558554 0.652595 0.057239 -0.154063 -0.473734 ... 0.337019 -0.614797 -0.057125 -0.069688 1.167145 0.828056 -0.541204 1.131136 -0.213529 0.000008
TTTTC 2.601067 0.128646 -0.267689 0.344941 -0.793282 0.625832 0.177349 0.472832 -0.466700 0.047076 ... -0.532616 -0.050942 -0.458822 0.365146 0.399918 0.206811 -0.098076 0.757820 -0.344745 0.000002
TTTTG 2.157413 0.586338 -0.397141 0.210420 -0.676836 0.744688 0.895972 0.421951 0.458642 -0.347510 ... -0.238223 0.093321 0.244368 0.207288 0.372603 0.086408 0.194293 0.455535 -0.283153 0.000004
TTTTT 0.244623 0.310516 0.500102 0.003702 -0.126980 1.207871 0.062054 0.033888 -0.166582 -0.031377 ... 0.030637 0.149465 -0.067944 0.222734 0.223881 0.163491 -0.061888 0.205998 -0.090435 0.000027

1024 rows × 22 columns