{ "cells": [ { "cell_type": "markdown", "id": "be5d3b4f", "metadata": {}, "source": [ "# Datasets\n", "\n", "In this section, we illustrate the built-in datasets that are used for demonstration of the library throughout the documentation and how to build and store new datasets to make them conveniently and locally accessible with the same interface.\n", "\n", "Datasets consist of complete combinatorial landscapes that can be visualized and analyzed, as well as the data from which they were derived. Both the inference of the complete landscape and the calculation of visualization coordinates are precomputed to provide quick access to the various layers of interest." ] }, { "cell_type": "code", "execution_count": 1, "id": "6b281ef0", "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from gpmap.datasets import DataSet, list_available_datasets\n", "from gpmap.inference import VCregression" ] }, { "cell_type": "markdown", "id": "c419153e", "metadata": {}, "source": [ "## How to load a built-in dataset\n", "\n", "We include a series of datasets that are used throughout the documentation for demonstration of the different applications and are directly accessible after installation of the library for any user.\n", "The list of built-in datasets can be easily shown as follows" ] }, { "cell_type": "code", "execution_count": 2, "id": "a40a04a0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['5ss', 'f1u', 'test', 'dmsc', 'gb1', 'smn1', 'serine', 'trna', 'pard']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_available_datasets()" ] }, { "cell_type": "markdown", "id": "508e43b7", "metadata": {}, "source": [ "### How to access combinatorial landscape values\n", "\n", "And one can easily load one of those datasets as illustrated in some previous tutorials, and all of them should contain at least a `landscape` attribute containing the phenotype associated to each possible genotype" ] }, { "cell_type": "code", "execution_count": 3, "id": "b5b44e02", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | y | \n", "
|---|---|
| seq | \n", "\n", " |
| AAAA | \n", "0.296301 | \n", "
| AAAC | \n", "-2.713474 | \n", "
| AAAD | \n", "-2.912992 | \n", "
| AAAE | \n", "-4.548719 | \n", "
| AAAF | \n", "-3.276738 | \n", "
| ... | \n", "... | \n", "
| YYYS | \n", "-4.662925 | \n", "
| YYYT | \n", "-3.223102 | \n", "
| YYYV | \n", "-3.001718 | \n", "
| YYYW | \n", "-4.723318 | \n", "
| YYYY | \n", "-4.876429 | \n", "
160000 rows × 1 columns
\n", "| \n", " | y | \n", "y_var | \n", "
|---|---|---|
| sequence | \n", "\n", " | \n", " |
| AAAA | \n", "0.460831 | \n", "0.046009 | \n", "
| AAAG | \n", "-2.192261 | \n", "0.255906 | \n", "
| AAAH | \n", "-4.728306 | \n", "2.064530 | \n", "
| AAAI | \n", "-4.338842 | \n", "2.095252 | \n", "
| AAAL | \n", "-2.326240 | \n", "0.087518 | \n", "
| ... | \n", "... | \n", "... | \n", "
| YYYS | \n", "-5.269987 | \n", "0.291090 | \n", "
| YYYT | \n", "-3.821426 | \n", "0.074489 | \n", "
| YYYV | \n", "-3.143536 | \n", "0.074682 | \n", "
| YYYW | \n", "-4.306581 | \n", "0.699467 | \n", "
| YYYY | \n", "-4.429813 | \n", "0.417405 | \n", "
149361 rows × 2 columns
\n", "| \n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "function | \n", "stationary_freq | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAA | \n", "-0.270938 | \n", "-0.944304 | \n", "-0.227171 | \n", "0.744803 | \n", "0.059077 | \n", "-0.077512 | \n", "-0.477853 | \n", "0.174491 | \n", "0.015944 | \n", "0.052664 | \n", "0.296301 | \n", "1.067767e-04 | \n", "
| AAAC | \n", "0.033789 | \n", "-0.232603 | \n", "-0.271458 | \n", "0.576487 | \n", "0.035619 | \n", "0.087608 | \n", "0.590118 | \n", "-0.249005 | \n", "-0.087750 | \n", "-0.110291 | \n", "-2.713474 | \n", "4.954648e-06 | \n", "
| AAAD | \n", "-0.020398 | \n", "-0.127749 | \n", "-0.174455 | \n", "0.347843 | \n", "0.142684 | \n", "0.208679 | \n", "0.590025 | \n", "0.160819 | \n", "0.354397 | \n", "0.676487 | \n", "-2.912992 | \n", "4.042194e-06 | \n", "
| AAAE | \n", "-0.001018 | \n", "-0.138712 | \n", "-0.183161 | \n", "0.340728 | \n", "0.121067 | \n", "0.157871 | \n", "0.436407 | \n", "0.195630 | \n", "0.211100 | \n", "0.364298 | \n", "-4.548719 | \n", "7.619345e-07 | \n", "
| AAAF | \n", "0.149717 | \n", "-0.156524 | \n", "-0.239304 | \n", "0.386243 | \n", "0.103285 | \n", "0.107756 | \n", "0.302406 | \n", "0.051575 | \n", "0.171278 | \n", "0.226772 | \n", "-3.276738 | \n", "2.789084e-06 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| YYYS | \n", "0.073880 | \n", "0.038075 | \n", "-0.097751 | \n", "0.156184 | \n", "0.056463 | \n", "0.074291 | \n", "0.262512 | \n", "0.019037 | \n", "0.144439 | \n", "0.172686 | \n", "-4.662925 | \n", "6.781399e-07 | \n", "
| YYYT | \n", "-0.091125 | \n", "0.213370 | \n", "0.256403 | \n", "0.246274 | \n", "-0.086279 | \n", "0.026923 | \n", "0.217086 | \n", "0.102111 | \n", "0.593257 | \n", "0.003682 | \n", "-3.223102 | \n", "2.945947e-06 | \n", "
| YYYV | \n", "0.016488 | \n", "0.195242 | \n", "0.216320 | \n", "0.035269 | \n", "0.306726 | \n", "0.025334 | \n", "0.217759 | \n", "-0.028542 | \n", "-0.038378 | \n", "0.148356 | \n", "-3.001718 | \n", "3.692393e-06 | \n", "
| YYYW | \n", "0.134072 | \n", "0.114107 | \n", "-0.043856 | \n", "0.011092 | \n", "0.076565 | \n", "0.109907 | \n", "0.261274 | \n", "0.108909 | \n", "0.180371 | \n", "0.365348 | \n", "-4.723318 | \n", "6.376209e-07 | \n", "
| YYYY | \n", "0.086278 | \n", "0.113188 | \n", "-0.035062 | \n", "0.017597 | \n", "0.096670 | \n", "0.249598 | \n", "0.261107 | \n", "0.108695 | \n", "0.188331 | \n", "0.368750 | \n", "-4.876429 | \n", "5.454157e-07 | \n", "
160000 rows × 12 columns
\n", "| \n", " | i | \n", "j | \n", "
|---|---|---|
| 0 | \n", "0 | \n", "1 | \n", "
| 1 | \n", "0 | \n", "2 | \n", "
| 2 | \n", "0 | \n", "3 | \n", "
| 3 | \n", "0 | \n", "4 | \n", "
| 4 | \n", "0 | \n", "5 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 6079995 | \n", "159996 | \n", "159998 | \n", "
| 6079996 | \n", "159996 | \n", "159999 | \n", "
| 6079997 | \n", "159997 | \n", "159998 | \n", "
| 6079998 | \n", "159997 | \n", "159999 | \n", "
| 6079999 | \n", "159998 | \n", "159999 | \n", "
6080000 rows × 2 columns
\n", "| \n", " | k | \n", "decay_rates | \n", "relaxation_time | \n", "
|---|---|---|---|
| 0 | \n", "1 | \n", "2.554843 | \n", "0.391413 | \n", "
| 1 | \n", "2 | \n", "3.566862 | \n", "0.280359 | \n", "
| 2 | \n", "3 | \n", "4.926568 | \n", "0.202981 | \n", "
| 3 | \n", "4 | \n", "5.023657 | \n", "0.199058 | \n", "
| 4 | \n", "5 | \n", "5.303026 | \n", "0.188572 | \n", "
| 5 | \n", "6 | \n", "5.635594 | \n", "0.177444 | \n", "
| 6 | \n", "7 | \n", "6.294868 | \n", "0.158860 | \n", "
| 7 | \n", "8 | \n", "6.543588 | \n", "0.152821 | \n", "
| 8 | \n", "9 | \n", "6.741685 | \n", "0.148331 | \n", "
| 9 | \n", "10 | \n", "7.000798 | \n", "0.142841 | \n", "
| \n", " | y | \n", "y_var | \n", "
|---|---|---|
| AAAAA | \n", "0.425225 | \n", "0.01 | \n", "
| AAAAC | \n", "0.025961 | \n", "0.01 | \n", "
| AAAAG | \n", "0.211691 | \n", "0.01 | \n", "
| AAAAT | \n", "-0.169990 | \n", "0.01 | \n", "
| AAACA | \n", "0.141775 | \n", "0.01 | \n", "
| ... | \n", "... | \n", "... | \n", "
| TTTGA | \n", "0.080597 | \n", "0.01 | \n", "
| TTTGC | \n", "-0.092545 | \n", "0.01 | \n", "
| TTTGG | \n", "-0.509507 | \n", "0.01 | \n", "
| TTTTA | \n", "0.510968 | \n", "0.01 | \n", "
| TTTTG | \n", "-0.646299 | \n", "0.01 | \n", "
816 rows × 2 columns
\n", "| \n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "... | \n", "13 | \n", "14 | \n", "15 | \n", "16 | \n", "17 | \n", "18 | \n", "19 | \n", "20 | \n", "function | \n", "stationary_freq | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAAA | \n", "1.179083 | \n", "-0.268895 | \n", "3.019227 | \n", "0.931148 | \n", "-0.423071 | \n", "0.313136 | \n", "0.423140 | \n", "-0.123224 | \n", "-0.889849 | \n", "0.905356 | \n", "... | \n", "0.049589 | \n", "-0.813528 | \n", "-0.733640 | \n", "0.179086 | \n", "0.185295 | \n", "0.014557 | \n", "0.233692 | \n", "0.093785 | \n", "0.387054 | \n", "1.734146e-04 | \n", "
| AAAAC | \n", "1.072408 | \n", "0.056435 | \n", "2.945652 | \n", "0.570610 | \n", "-0.548389 | \n", "0.499496 | \n", "-0.060271 | \n", "0.210738 | \n", "0.454159 | \n", "0.315377 | \n", "... | \n", "0.027614 | \n", "0.182956 | \n", "0.041282 | \n", "0.196853 | \n", "0.012360 | \n", "-0.103109 | \n", "0.145002 | \n", "0.012005 | \n", "0.079653 | \n", "4.783712e-06 | \n", "
| AAAAG | \n", "-0.424879 | \n", "-0.173879 | \n", "1.601030 | \n", "0.393521 | \n", "-0.376543 | \n", "0.198026 | \n", "0.068529 | \n", "0.023428 | \n", "-0.521934 | \n", "-0.191951 | \n", "... | \n", "-0.087000 | \n", "-0.027370 | \n", "-0.045159 | \n", "0.083975 | \n", "-0.047376 | \n", "-0.036584 | \n", "0.109612 | \n", "0.077264 | \n", "0.242677 | \n", "3.211605e-05 | \n", "
| AAAAT | \n", "1.638025 | \n", "-0.526015 | \n", "0.963103 | \n", "0.227400 | \n", "-0.480507 | \n", "0.580577 | \n", "0.149562 | \n", "-0.231371 | \n", "-0.251046 | \n", "0.174849 | \n", "... | \n", "0.201388 | \n", "-0.116860 | \n", "0.107007 | \n", "0.044851 | \n", "0.242405 | \n", "-0.009095 | \n", "0.133066 | \n", "0.010694 | \n", "-0.130124 | \n", "4.127115e-07 | \n", "
| AAACA | \n", "2.029814 | \n", "0.478617 | \n", "1.541410 | \n", "1.394887 | \n", "-0.103456 | \n", "0.165921 | \n", "2.445179 | \n", "0.035222 | \n", "-0.144542 | \n", "0.367994 | \n", "... | \n", "0.299338 | \n", "-0.400981 | \n", "-0.294678 | \n", "1.177327 | \n", "0.726800 | \n", "0.021389 | \n", "0.309360 | \n", "0.199590 | \n", "0.133852 | \n", "9.009335e-06 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| TTTGT | \n", "3.332679 | \n", "1.524216 | \n", "-0.882499 | \n", "-0.696858 | \n", "0.836397 | \n", "-1.156986 | \n", "0.556218 | \n", "0.500014 | \n", "0.156163 | \n", "0.216689 | \n", "... | \n", "-0.609086 | \n", "0.306015 | \n", "0.283575 | \n", "0.606585 | \n", "0.218290 | \n", "-0.430605 | \n", "1.027785 | \n", "0.702348 | \n", "-0.115364 | \n", "4.903661e-07 | \n", "
| TTTTA | \n", "2.298655 | \n", "-0.360357 | \n", "-0.505533 | \n", "1.352307 | \n", "6.025070 | \n", "0.182953 | \n", "0.533421 | \n", "-0.309203 | \n", "0.471387 | \n", "0.024318 | \n", "... | \n", "-0.550174 | \n", "0.808017 | \n", "-0.201772 | \n", "1.759322 | \n", "0.725695 | \n", "-0.542566 | \n", "2.092282 | \n", "3.031618 | \n", "0.390967 | \n", "1.815254e-04 | \n", "
| TTTTC | \n", "2.684310 | \n", "1.833576 | \n", "0.192108 | \n", "-0.550983 | \n", "1.460191 | \n", "0.469683 | \n", "0.416367 | \n", "0.345825 | \n", "1.089355 | \n", "-0.056080 | \n", "... | \n", "-0.101757 | \n", "0.570731 | \n", "0.027791 | \n", "1.139062 | \n", "-0.095064 | \n", "-0.590156 | \n", "1.184556 | \n", "0.822767 | \n", "-0.252124 | \n", "9.926395e-08 | \n", "
| TTTTG | \n", "1.489311 | \n", "-0.156888 | \n", "-0.669274 | \n", "0.185673 | \n", "1.648772 | \n", "0.578333 | \n", "0.343027 | \n", "0.667288 | \n", "0.128079 | \n", "-0.152963 | \n", "... | \n", "-0.138289 | \n", "0.307081 | \n", "-0.010085 | \n", "0.994944 | \n", "0.112638 | \n", "-0.182646 | \n", "0.764041 | \n", "0.776003 | \n", "-0.557506 | \n", "2.803548e-09 | \n", "
| TTTTT | \n", "2.870706 | \n", "-0.490323 | \n", "-1.859482 | \n", "0.103815 | \n", "0.915930 | \n", "1.053966 | \n", "0.071009 | \n", "0.573608 | \n", "0.114060 | \n", "0.178955 | \n", "... | \n", "-0.906039 | \n", "0.441809 | \n", "0.200234 | \n", "0.844890 | \n", "0.371698 | \n", "-0.195801 | \n", "1.786632 | \n", "1.891382 | \n", "0.064394 | \n", "4.002759e-06 | \n", "
1024 rows × 22 columns
\n", "