Datasets
In this section, we illustrate the built-in datasets that are used for demonstration of the library throughout the documentation and how to build and store new datasets to make them conveniently and locally accessible with the same interface.
Datasets consist of complete combinatorial landscapes that can be visualized and analyzed, as well as the data from which they were derived. Both the inference of the complete landscape and the calculation of visualization coordinates are precomputed to provide quick access to the various layers of interest.
[1]:
# Import required libraries
import numpy as np
import pandas as pd
from gpmap.datasets import DataSet, list_available_datasets
from gpmap.inference import VCregression
How to load a built-in dataset
We include a series of datasets that are used throughout the documentation for demonstration of the different applications and are directly accessible after installation of the library for any user. The list of built-in datasets can be easily shown as follows
[2]:
list_available_datasets()
[2]:
['5ss', 'f1u', 'test', 'dmsc', 'gb1', 'smn1', 'serine', 'trna', 'pard']
How to access combinatorial landscape values
And one can easily load one of those datasets as illustrated in some previous tutorials, and all of them should contain at least a landscape attribute containing the phenotype associated to each possible genotype
[3]:
gb1 = DataSet('gb1')
gb1.landscape
[3]:
| y | |
|---|---|
| seq | |
| AAAA | 0.296301 |
| AAAC | -2.713474 |
| AAAD | -2.912992 |
| AAAE | -4.548719 |
| AAAF | -3.276738 |
| ... | ... |
| YYYS | -4.662925 |
| YYYT | -3.223102 |
| YYYV | -3.001718 |
| YYYW | -4.723318 |
| YYYY | -4.876429 |
160000 rows × 1 columns
How to access the processed data in experimental datasets
If the landscape was obtained from experimental data, then it also has a data attribute that includes the measurement y and, if available, its uncertainty y_var. The data may not necessarily include measurements for every possible sequence, as in this case, in which about ~10000 sequences were not experimentally measured
[4]:
gb1.data
[4]:
| y | y_var | |
|---|---|---|
| sequence | ||
| AAAA | 0.460831 | 0.046009 |
| AAAG | -2.192261 | 0.255906 |
| AAAH | -4.728306 | 2.064530 |
| AAAI | -4.338842 | 2.095252 |
| AAAL | -2.326240 | 0.087518 |
| ... | ... | ... |
| YYYS | -5.269987 | 0.291090 |
| YYYT | -3.821426 | 0.074489 |
| YYYV | -3.143536 | 0.074682 |
| YYYW | -4.306581 | 0.699467 |
| YYYY | -4.429813 | 0.417405 |
149361 rows × 2 columns
How to access the a dataset visualization
For built-in datasets, we also provide the pre-calculated coordinates of the visualization, the DataFrame connecting sequences separated by single point mutations and the relaxation times associated to each of the diffusion axes in the attributes nodes, edges and relaxation_times
[5]:
gb1.nodes
[5]:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | function | stationary_freq | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAA | -0.270938 | -0.944304 | -0.227171 | 0.744803 | 0.059077 | -0.077512 | -0.477853 | 0.174491 | 0.015944 | 0.052664 | 0.296301 | 1.067767e-04 |
| AAAC | 0.033789 | -0.232603 | -0.271458 | 0.576487 | 0.035619 | 0.087608 | 0.590118 | -0.249005 | -0.087750 | -0.110291 | -2.713474 | 4.954648e-06 |
| AAAD | -0.020398 | -0.127749 | -0.174455 | 0.347843 | 0.142684 | 0.208679 | 0.590025 | 0.160819 | 0.354397 | 0.676487 | -2.912992 | 4.042194e-06 |
| AAAE | -0.001018 | -0.138712 | -0.183161 | 0.340728 | 0.121067 | 0.157871 | 0.436407 | 0.195630 | 0.211100 | 0.364298 | -4.548719 | 7.619345e-07 |
| AAAF | 0.149717 | -0.156524 | -0.239304 | 0.386243 | 0.103285 | 0.107756 | 0.302406 | 0.051575 | 0.171278 | 0.226772 | -3.276738 | 2.789084e-06 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| YYYS | 0.073880 | 0.038075 | -0.097751 | 0.156184 | 0.056463 | 0.074291 | 0.262512 | 0.019037 | 0.144439 | 0.172686 | -4.662925 | 6.781399e-07 |
| YYYT | -0.091125 | 0.213370 | 0.256403 | 0.246274 | -0.086279 | 0.026923 | 0.217086 | 0.102111 | 0.593257 | 0.003682 | -3.223102 | 2.945947e-06 |
| YYYV | 0.016488 | 0.195242 | 0.216320 | 0.035269 | 0.306726 | 0.025334 | 0.217759 | -0.028542 | -0.038378 | 0.148356 | -3.001718 | 3.692393e-06 |
| YYYW | 0.134072 | 0.114107 | -0.043856 | 0.011092 | 0.076565 | 0.109907 | 0.261274 | 0.108909 | 0.180371 | 0.365348 | -4.723318 | 6.376209e-07 |
| YYYY | 0.086278 | 0.113188 | -0.035062 | 0.017597 | 0.096670 | 0.249598 | 0.261107 | 0.108695 | 0.188331 | 0.368750 | -4.876429 | 5.454157e-07 |
160000 rows × 12 columns
[6]:
gb1.edges
[6]:
| i | j | |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 0 | 2 |
| 2 | 0 | 3 |
| 3 | 0 | 4 |
| 4 | 0 | 5 |
| ... | ... | ... |
| 6079995 | 159996 | 159998 |
| 6079996 | 159996 | 159999 |
| 6079997 | 159997 | 159998 |
| 6079998 | 159997 | 159999 |
| 6079999 | 159998 | 159999 |
6080000 rows × 2 columns
[7]:
gb1.relaxation_times
[7]:
| k | decay_rates | relaxation_time | |
|---|---|---|---|
| 0 | 1 | 2.554843 | 0.391413 |
| 1 | 2 | 3.566862 | 0.280359 |
| 2 | 3 | 4.926568 | 0.202981 |
| 3 | 4 | 5.023657 | 0.199058 |
| 4 | 5 | 5.303026 | 0.188572 |
| 5 | 6 | 5.635594 | 0.177444 |
| 6 | 7 | 6.294868 | 0.158860 |
| 7 | 8 | 6.543588 | 0.152821 |
| 8 | 9 | 6.741685 | 0.148331 |
| 9 | 10 | 7.000798 | 0.142841 |
How to build new datasets
We also provide functionality to create new datasets and store them in the local copy of your library for easier access. Lets build new datasets from simulated data
[8]:
np.random.seed(0)
lambdas = np.array([10, 2, 0.5, 0.1, 0.02, 0])
model = VCregression(seq_length=5, alphabet_type='dna', lambdas=lambdas)
f, X, y, y_var = model.simulate(p_missing=0.2, y_var=0.01)
data = pd.DataFrame({'y': y, 'y_var': y_var}, index=X)
data
[8]:
| y | y_var | |
|---|---|---|
| AAAAA | 0.425225 | 0.01 |
| AAAAC | 0.025961 | 0.01 |
| AAAAG | 0.211691 | 0.01 |
| AAAAT | -0.169990 | 0.01 |
| AAACA | 0.141775 | 0.01 |
| ... | ... | ... |
| TTTGA | 0.080597 | 0.01 |
| TTTGC | -0.092545 | 0.01 |
| TTTGG | -0.509507 | 0.01 |
| TTTTA | 0.510968 | 0.01 |
| TTTTG | -0.646299 | 0.01 |
816 rows × 2 columns
The method build will use some default values to run Variance Component regression and compute visualization coordinates automatically, but may not be the best choice for any particular dataset.
[9]:
test = DataSet('test', data=data)
test.build()
100%|██████████| 100/100 [00:02<00:00, 42.60it/s]
We can now re-load the dataset from disk and verify that it contains the visualization attributes
Note that reinstalling the library will erase the newly created
DataSets
[10]:
test = DataSet('test')
test.nodes
[10]:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | function | stationary_freq | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAAA | 1.179083 | -0.268895 | 3.019227 | 0.931148 | -0.423071 | 0.313136 | 0.423140 | -0.123224 | -0.889849 | 0.905356 | ... | 0.049589 | -0.813528 | -0.733640 | 0.179086 | 0.185295 | 0.014557 | 0.233692 | 0.093785 | 0.387054 | 1.734146e-04 |
| AAAAC | 1.072408 | 0.056435 | 2.945652 | 0.570610 | -0.548389 | 0.499496 | -0.060271 | 0.210738 | 0.454159 | 0.315377 | ... | 0.027614 | 0.182956 | 0.041282 | 0.196853 | 0.012360 | -0.103109 | 0.145002 | 0.012005 | 0.079653 | 4.783712e-06 |
| AAAAG | -0.424879 | -0.173879 | 1.601030 | 0.393521 | -0.376543 | 0.198026 | 0.068529 | 0.023428 | -0.521934 | -0.191951 | ... | -0.087000 | -0.027370 | -0.045159 | 0.083975 | -0.047376 | -0.036584 | 0.109612 | 0.077264 | 0.242677 | 3.211605e-05 |
| AAAAT | 1.638025 | -0.526015 | 0.963103 | 0.227400 | -0.480507 | 0.580577 | 0.149562 | -0.231371 | -0.251046 | 0.174849 | ... | 0.201388 | -0.116860 | 0.107007 | 0.044851 | 0.242405 | -0.009095 | 0.133066 | 0.010694 | -0.130124 | 4.127115e-07 |
| AAACA | 2.029814 | 0.478617 | 1.541410 | 1.394887 | -0.103456 | 0.165921 | 2.445179 | 0.035222 | -0.144542 | 0.367994 | ... | 0.299338 | -0.400981 | -0.294678 | 1.177327 | 0.726800 | 0.021389 | 0.309360 | 0.199590 | 0.133852 | 9.009335e-06 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTTGT | 3.332679 | 1.524216 | -0.882499 | -0.696858 | 0.836397 | -1.156986 | 0.556218 | 0.500014 | 0.156163 | 0.216689 | ... | -0.609086 | 0.306015 | 0.283575 | 0.606585 | 0.218290 | -0.430605 | 1.027785 | 0.702348 | -0.115364 | 4.903661e-07 |
| TTTTA | 2.298655 | -0.360357 | -0.505533 | 1.352307 | 6.025070 | 0.182953 | 0.533421 | -0.309203 | 0.471387 | 0.024318 | ... | -0.550174 | 0.808017 | -0.201772 | 1.759322 | 0.725695 | -0.542566 | 2.092282 | 3.031618 | 0.390967 | 1.815254e-04 |
| TTTTC | 2.684310 | 1.833576 | 0.192108 | -0.550983 | 1.460191 | 0.469683 | 0.416367 | 0.345825 | 1.089355 | -0.056080 | ... | -0.101757 | 0.570731 | 0.027791 | 1.139062 | -0.095064 | -0.590156 | 1.184556 | 0.822767 | -0.252124 | 9.926395e-08 |
| TTTTG | 1.489311 | -0.156888 | -0.669274 | 0.185673 | 1.648772 | 0.578333 | 0.343027 | 0.667288 | 0.128079 | -0.152963 | ... | -0.138289 | 0.307081 | -0.010085 | 0.994944 | 0.112638 | -0.182646 | 0.764041 | 0.776003 | -0.557506 | 2.803548e-09 |
| TTTTT | 2.870706 | -0.490323 | -1.859482 | 0.103815 | 0.915930 | 1.053966 | 0.071009 | 0.573608 | 0.114060 | 0.178955 | ... | -0.906039 | 0.441809 | 0.200234 | 0.844890 | 0.371698 | -0.195801 | 1.786632 | 1.891382 | 0.064394 | 4.002759e-06 |
1024 rows × 22 columns