Datasets

In this section, we illustrate the built-in datasets that are used for demonstration of the library throughout the documentation and how to build and store new datasets to make them conveniently and locally accessible with the same interface.

Datasets consist of complete combinatorial landscapes that can be visualized and analyzed, as well as the data from which they were derived. Both the inference of the complete landscape and the calculation of visualization coordinates are precomputed to provide quick access to the various layers of interest.

[1]:

# Import required libraries
import numpy as np
import pandas as pd

from gpmap.datasets import DataSet, list_available_datasets
from gpmap.inference import VCregression

How to load a built-in dataset

We include a series of datasets that are used throughout the documentation for demonstration of the different applications and are directly accessible after installation of the library for any user. The list of built-in datasets can be easily shown as follows

[2]:

list_available_datasets()

[2]:

['5ss', 'f1u', 'test', 'dmsc', 'gb1', 'smn1', 'serine', 'trna', 'pard']

How to access combinatorial landscape values

And one can easily load one of those datasets as illustrated in some previous tutorials, and all of them should contain at least a landscape attribute containing the phenotype associated to each possible genotype

[3]:

gb1 = DataSet('gb1')
gb1.landscape

[3]:

	y
seq
AAAA	0.296301
AAAC	-2.713474
AAAD	-2.912992
AAAE	-4.548719
AAAF	-3.276738
...	...
YYYS	-4.662925
YYYT	-3.223102
YYYV	-3.001718
YYYW	-4.723318
YYYY	-4.876429

160000 rows × 1 columns

How to access the processed data in experimental datasets

If the landscape was obtained from experimental data, then it also has a data attribute that includes the measurement y and, if available, its uncertainty y_var. The data may not necessarily include measurements for every possible sequence, as in this case, in which about ~10000 sequences were not experimentally measured

[4]:

gb1.data

[4]:

	y	y_var
sequence
AAAA	0.460831	0.046009
AAAG	-2.192261	0.255906
AAAH	-4.728306	2.064530
AAAI	-4.338842	2.095252
AAAL	-2.326240	0.087518
...	...	...
YYYS	-5.269987	0.291090
YYYT	-3.821426	0.074489
YYYV	-3.143536	0.074682
YYYW	-4.306581	0.699467
YYYY	-4.429813	0.417405

149361 rows × 2 columns

How to access the a dataset visualization

For built-in datasets, we also provide the pre-calculated coordinates of the visualization, the DataFrame connecting sequences separated by single point mutations and the relaxation times associated to each of the diffusion axes in the attributes nodes, edges and relaxation_times

[5]:

gb1.nodes

[5]:

	1	2	3	4	5	6	7	8	9	10	function	stationary_freq
AAAA	-0.270938	-0.944304	-0.227171	0.744803	0.059077	-0.077512	-0.477853	0.174491	0.015944	0.052664	0.296301	1.067767e-04
AAAC	0.033789	-0.232603	-0.271458	0.576487	0.035619	0.087608	0.590118	-0.249005	-0.087750	-0.110291	-2.713474	4.954648e-06
AAAD	-0.020398	-0.127749	-0.174455	0.347843	0.142684	0.208679	0.590025	0.160819	0.354397	0.676487	-2.912992	4.042194e-06
AAAE	-0.001018	-0.138712	-0.183161	0.340728	0.121067	0.157871	0.436407	0.195630	0.211100	0.364298	-4.548719	7.619345e-07
AAAF	0.149717	-0.156524	-0.239304	0.386243	0.103285	0.107756	0.302406	0.051575	0.171278	0.226772	-3.276738	2.789084e-06
...	...	...	...	...	...	...	...	...	...	...	...	...
YYYS	0.073880	0.038075	-0.097751	0.156184	0.056463	0.074291	0.262512	0.019037	0.144439	0.172686	-4.662925	6.781399e-07
YYYT	-0.091125	0.213370	0.256403	0.246274	-0.086279	0.026923	0.217086	0.102111	0.593257	0.003682	-3.223102	2.945947e-06
YYYV	0.016488	0.195242	0.216320	0.035269	0.306726	0.025334	0.217759	-0.028542	-0.038378	0.148356	-3.001718	3.692393e-06
YYYW	0.134072	0.114107	-0.043856	0.011092	0.076565	0.109907	0.261274	0.108909	0.180371	0.365348	-4.723318	6.376209e-07
YYYY	0.086278	0.113188	-0.035062	0.017597	0.096670	0.249598	0.261107	0.108695	0.188331	0.368750	-4.876429	5.454157e-07

160000 rows × 12 columns

[6]:

gb1.edges

[6]:

	i	j
0	0	1
1	0	2
2	0	3
3	0	4
4	0	5
...	...	...
6079995	159996	159998
6079996	159996	159999
6079997	159997	159998
6079998	159997	159999
6079999	159998	159999

6080000 rows × 2 columns

[7]:

gb1.relaxation_times

[7]:

	k	decay_rates	relaxation_time
0	1	2.554843	0.391413
1	2	3.566862	0.280359
2	3	4.926568	0.202981
3	4	5.023657	0.199058
4	5	5.303026	0.188572
5	6	5.635594	0.177444
6	7	6.294868	0.158860
7	8	6.543588	0.152821
8	9	6.741685	0.148331
9	10	7.000798	0.142841

How to build new datasets

We also provide functionality to create new datasets and store them in the local copy of your library for easier access. Lets build new datasets from simulated data

[8]:

np.random.seed(0)
lambdas = np.array([10, 2, 0.5, 0.1, 0.02, 0])
model = VCregression(seq_length=5, alphabet_type='dna', lambdas=lambdas)
f, X, y, y_var = model.simulate(p_missing=0.2, y_var=0.01)
data = pd.DataFrame({'y': y, 'y_var': y_var}, index=X)
data

[8]:

	y	y_var
AAAAA	0.425225	0.01
AAAAC	0.025961	0.01
AAAAG	0.211691	0.01
AAAAT	-0.169990	0.01
AAACA	0.141775	0.01
...	...	...
TTTGA	0.080597	0.01
TTTGC	-0.092545	0.01
TTTGG	-0.509507	0.01
TTTTA	0.510968	0.01
TTTTG	-0.646299	0.01

816 rows × 2 columns

The method build will use some default values to run Variance Component regression and compute visualization coordinates automatically, but may not be the best choice for any particular dataset.

[9]:

test = DataSet('test', data=data)
test.build()

100%|██████████| 100/100 [00:02<00:00, 42.60it/s]

We can now re-load the dataset from disk and verify that it contains the visualization attributes

Note that reinstalling the library will erase the newly created DataSets

[10]:

test = DataSet('test')
test.nodes

[10]:

	1	2	3	4	5	6	7	8	9	10	...	13	14	15	16	17	18	19	20	function	stationary_freq
AAAAA	1.179083	-0.268895	3.019227	0.931148	-0.423071	0.313136	0.423140	-0.123224	-0.889849	0.905356	...	0.049589	-0.813528	-0.733640	0.179086	0.185295	0.014557	0.233692	0.093785	0.387054	1.734146e-04
AAAAC	1.072408	0.056435	2.945652	0.570610	-0.548389	0.499496	-0.060271	0.210738	0.454159	0.315377	...	0.027614	0.182956	0.041282	0.196853	0.012360	-0.103109	0.145002	0.012005	0.079653	4.783712e-06
AAAAG	-0.424879	-0.173879	1.601030	0.393521	-0.376543	0.198026	0.068529	0.023428	-0.521934	-0.191951	...	-0.087000	-0.027370	-0.045159	0.083975	-0.047376	-0.036584	0.109612	0.077264	0.242677	3.211605e-05
AAAAT	1.638025	-0.526015	0.963103	0.227400	-0.480507	0.580577	0.149562	-0.231371	-0.251046	0.174849	...	0.201388	-0.116860	0.107007	0.044851	0.242405	-0.009095	0.133066	0.010694	-0.130124	4.127115e-07
AAACA	2.029814	0.478617	1.541410	1.394887	-0.103456	0.165921	2.445179	0.035222	-0.144542	0.367994	...	0.299338	-0.400981	-0.294678	1.177327	0.726800	0.021389	0.309360	0.199590	0.133852	9.009335e-06
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTTGT	3.332679	1.524216	-0.882499	-0.696858	0.836397	-1.156986	0.556218	0.500014	0.156163	0.216689	...	-0.609086	0.306015	0.283575	0.606585	0.218290	-0.430605	1.027785	0.702348	-0.115364	4.903661e-07
TTTTA	2.298655	-0.360357	-0.505533	1.352307	6.025070	0.182953	0.533421	-0.309203	0.471387	0.024318	...	-0.550174	0.808017	-0.201772	1.759322	0.725695	-0.542566	2.092282	3.031618	0.390967	1.815254e-04
TTTTC	2.684310	1.833576	0.192108	-0.550983	1.460191	0.469683	0.416367	0.345825	1.089355	-0.056080	...	-0.101757	0.570731	0.027791	1.139062	-0.095064	-0.590156	1.184556	0.822767	-0.252124	9.926395e-08
TTTTG	1.489311	-0.156888	-0.669274	0.185673	1.648772	0.578333	0.343027	0.667288	0.128079	-0.152963	...	-0.138289	0.307081	-0.010085	0.994944	0.112638	-0.182646	0.764041	0.776003	-0.557506	2.803548e-09
TTTTT	2.870706	-0.490323	-1.859482	0.103815	0.915930	1.053966	0.071009	0.573608	0.114060	0.178955	...	-0.906039	0.441809	0.200234	0.844890	0.371698	-0.195801	1.786632	1.891382	0.064394	4.002759e-06

1024 rows × 22 columns