{ "cells": [ { "cell_type": "markdown", "id": "be5d3b4f", "metadata": {}, "source": [ "# Datasets\n", "\n", "Datasets consist of complete combinatorial landscapes that can be visualized and analyzed, as well as the data from which they were derived. Both the inference of the complete landscape and the calculation of visualization coordinates are precomputed to provide quick access to the various layers of interest." ] }, { "cell_type": "code", "execution_count": 1, "id": "6b281ef0", "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from gpmap.datasets import DataSet, list_available_datasets\n", "from gpmap.inference import VCregression" ] }, { "cell_type": "markdown", "id": "c419153e", "metadata": {}, "source": [ "## How to load a built-in dataset\n", "\n", "We include a series of datasets that are used throughout the documentation for demonstration of the different applications and are directly accessible after installation of the library for any user.\n", "The list of built-in datasets can be easily shown as follows" ] }, { "cell_type": "code", "execution_count": 2, "id": "a40a04a0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['5ss', 'f1u', 'test', 'dmsc', 'gb1', 'smn1', 'serine', 'trna', 'pard']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_available_datasets()" ] }, { "cell_type": "markdown", "id": "508e43b7", "metadata": {}, "source": [ "### How to access combinatorial landscape values\n", "\n", "And one can easily load one of those datasets as illustrated in some previous tutorials, and all of them should contain at least a `landscape` attribute containing the phenotype associated to each possible genotype" ] }, { "cell_type": "code", "execution_count": 3, "id": "b5b44e02", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | y | \n", "
|---|---|
| seq | \n", "\n", " |
| AAAA | \n", "0.296301 | \n", "
| AAAC | \n", "-2.713474 | \n", "
| AAAD | \n", "-2.912992 | \n", "
| AAAE | \n", "-4.548719 | \n", "
| AAAF | \n", "-3.276738 | \n", "
| ... | \n", "... | \n", "
| YYYS | \n", "-4.662925 | \n", "
| YYYT | \n", "-3.223102 | \n", "
| YYYV | \n", "-3.001718 | \n", "
| YYYW | \n", "-4.723318 | \n", "
| YYYY | \n", "-4.876429 | \n", "
160000 rows × 1 columns
\n", "| \n", " | y | \n", "y_var | \n", "
|---|---|---|
| sequence | \n", "\n", " | \n", " |
| AAAA | \n", "0.460831 | \n", "0.046009 | \n", "
| AAAG | \n", "-2.192261 | \n", "0.255906 | \n", "
| AAAH | \n", "-4.728306 | \n", "2.064530 | \n", "
| AAAI | \n", "-4.338842 | \n", "2.095252 | \n", "
| AAAL | \n", "-2.326240 | \n", "0.087518 | \n", "
| ... | \n", "... | \n", "... | \n", "
| YYYS | \n", "-5.269987 | \n", "0.291090 | \n", "
| YYYT | \n", "-3.821426 | \n", "0.074489 | \n", "
| YYYV | \n", "-3.143536 | \n", "0.074682 | \n", "
| YYYW | \n", "-4.306581 | \n", "0.699467 | \n", "
| YYYY | \n", "-4.429813 | \n", "0.417405 | \n", "
149361 rows × 2 columns
\n", "| \n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "function | \n", "stationary_freq | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAA | \n", "-0.270938 | \n", "-0.944304 | \n", "-0.227171 | \n", "0.744803 | \n", "0.059077 | \n", "-0.077512 | \n", "-0.477853 | \n", "0.174491 | \n", "0.015944 | \n", "0.052664 | \n", "0.296301 | \n", "1.067767e-04 | \n", "
| AAAC | \n", "0.033789 | \n", "-0.232603 | \n", "-0.271458 | \n", "0.576487 | \n", "0.035619 | \n", "0.087608 | \n", "0.590118 | \n", "-0.249005 | \n", "-0.087750 | \n", "-0.110291 | \n", "-2.713474 | \n", "4.954648e-06 | \n", "
| AAAD | \n", "-0.020398 | \n", "-0.127749 | \n", "-0.174455 | \n", "0.347843 | \n", "0.142684 | \n", "0.208679 | \n", "0.590025 | \n", "0.160819 | \n", "0.354397 | \n", "0.676487 | \n", "-2.912992 | \n", "4.042194e-06 | \n", "
| AAAE | \n", "-0.001018 | \n", "-0.138712 | \n", "-0.183161 | \n", "0.340728 | \n", "0.121067 | \n", "0.157871 | \n", "0.436407 | \n", "0.195630 | \n", "0.211100 | \n", "0.364298 | \n", "-4.548719 | \n", "7.619345e-07 | \n", "
| AAAF | \n", "0.149717 | \n", "-0.156524 | \n", "-0.239304 | \n", "0.386243 | \n", "0.103285 | \n", "0.107756 | \n", "0.302406 | \n", "0.051575 | \n", "0.171278 | \n", "0.226772 | \n", "-3.276738 | \n", "2.789084e-06 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| YYYS | \n", "0.073880 | \n", "0.038075 | \n", "-0.097751 | \n", "0.156184 | \n", "0.056463 | \n", "0.074291 | \n", "0.262512 | \n", "0.019037 | \n", "0.144439 | \n", "0.172686 | \n", "-4.662925 | \n", "6.781399e-07 | \n", "
| YYYT | \n", "-0.091125 | \n", "0.213370 | \n", "0.256403 | \n", "0.246274 | \n", "-0.086279 | \n", "0.026923 | \n", "0.217086 | \n", "0.102111 | \n", "0.593257 | \n", "0.003682 | \n", "-3.223102 | \n", "2.945947e-06 | \n", "
| YYYV | \n", "0.016488 | \n", "0.195242 | \n", "0.216320 | \n", "0.035269 | \n", "0.306726 | \n", "0.025334 | \n", "0.217759 | \n", "-0.028542 | \n", "-0.038378 | \n", "0.148356 | \n", "-3.001718 | \n", "3.692393e-06 | \n", "
| YYYW | \n", "0.134072 | \n", "0.114107 | \n", "-0.043856 | \n", "0.011092 | \n", "0.076565 | \n", "0.109907 | \n", "0.261274 | \n", "0.108909 | \n", "0.180371 | \n", "0.365348 | \n", "-4.723318 | \n", "6.376209e-07 | \n", "
| YYYY | \n", "0.086278 | \n", "0.113188 | \n", "-0.035062 | \n", "0.017597 | \n", "0.096670 | \n", "0.249598 | \n", "0.261107 | \n", "0.108695 | \n", "0.188331 | \n", "0.368750 | \n", "-4.876429 | \n", "5.454157e-07 | \n", "
160000 rows × 12 columns
\n", "| \n", " | i | \n", "j | \n", "
|---|---|---|
| 0 | \n", "0 | \n", "1 | \n", "
| 1 | \n", "0 | \n", "2 | \n", "
| 2 | \n", "0 | \n", "3 | \n", "
| 3 | \n", "0 | \n", "4 | \n", "
| 4 | \n", "0 | \n", "5 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 6079995 | \n", "159996 | \n", "159998 | \n", "
| 6079996 | \n", "159996 | \n", "159999 | \n", "
| 6079997 | \n", "159997 | \n", "159998 | \n", "
| 6079998 | \n", "159997 | \n", "159999 | \n", "
| 6079999 | \n", "159998 | \n", "159999 | \n", "
6080000 rows × 2 columns
\n", "| \n", " | k | \n", "decay_rates | \n", "relaxation_time | \n", "
|---|---|---|---|
| 0 | \n", "1 | \n", "2.554843 | \n", "0.391413 | \n", "
| 1 | \n", "2 | \n", "3.566862 | \n", "0.280359 | \n", "
| 2 | \n", "3 | \n", "4.926568 | \n", "0.202981 | \n", "
| 3 | \n", "4 | \n", "5.023657 | \n", "0.199058 | \n", "
| 4 | \n", "5 | \n", "5.303026 | \n", "0.188572 | \n", "
| 5 | \n", "6 | \n", "5.635594 | \n", "0.177444 | \n", "
| 6 | \n", "7 | \n", "6.294868 | \n", "0.158860 | \n", "
| 7 | \n", "8 | \n", "6.543588 | \n", "0.152821 | \n", "
| 8 | \n", "9 | \n", "6.741685 | \n", "0.148331 | \n", "
| 9 | \n", "10 | \n", "7.000798 | \n", "0.142841 | \n", "
| \n", " | y | \n", "y_var | \n", "
|---|---|---|
| AAAAA | \n", "0.425225 | \n", "0.01 | \n", "
| AAAAC | \n", "0.025961 | \n", "0.01 | \n", "
| AAAAG | \n", "0.211691 | \n", "0.01 | \n", "
| AAAAT | \n", "-0.169990 | \n", "0.01 | \n", "
| AAACA | \n", "0.141775 | \n", "0.01 | \n", "
| ... | \n", "... | \n", "... | \n", "
| TTTGA | \n", "0.080597 | \n", "0.01 | \n", "
| TTTGC | \n", "-0.092545 | \n", "0.01 | \n", "
| TTTGG | \n", "-0.509507 | \n", "0.01 | \n", "
| TTTTA | \n", "0.510968 | \n", "0.01 | \n", "
| TTTTG | \n", "-0.646299 | \n", "0.01 | \n", "
816 rows × 2 columns
\n", "| \n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "... | \n", "13 | \n", "14 | \n", "15 | \n", "16 | \n", "17 | \n", "18 | \n", "19 | \n", "20 | \n", "function | \n", "stationary_freq | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAAAA | \n", "1.179083 | \n", "-0.268895 | \n", "3.019227 | \n", "0.931148 | \n", "-0.423071 | \n", "0.313136 | \n", "0.423140 | \n", "-0.123224 | \n", "-0.889849 | \n", "0.905356 | \n", "... | \n", "0.049589 | \n", "-0.813528 | \n", "-0.733640 | \n", "0.179086 | \n", "0.185295 | \n", "0.014557 | \n", "0.233692 | \n", "0.093785 | \n", "0.387054 | \n", "1.734146e-04 | \n", "
| AAAAC | \n", "1.072408 | \n", "0.056435 | \n", "2.945652 | \n", "0.570610 | \n", "-0.548389 | \n", "0.499496 | \n", "-0.060271 | \n", "0.210738 | \n", "0.454159 | \n", "0.315377 | \n", "... | \n", "0.027614 | \n", "0.182956 | \n", "0.041282 | \n", "0.196853 | \n", "0.012360 | \n", "-0.103109 | \n", "0.145002 | \n", "0.012005 | \n", "0.079653 | \n", "4.783712e-06 | \n", "
| AAAAG | \n", "-0.424879 | \n", "-0.173879 | \n", "1.601030 | \n", "0.393521 | \n", "-0.376543 | \n", "0.198026 | \n", "0.068529 | \n", "0.023428 | \n", "-0.521934 | \n", "-0.191951 | \n", "... | \n", "-0.087000 | \n", "-0.027370 | \n", "-0.045159 | \n", "0.083975 | \n", "-0.047376 | \n", "-0.036584 | \n", "0.109612 | \n", "0.077264 | \n", "0.242677 | \n", "3.211605e-05 | \n", "
| AAAAT | \n", "1.638025 | \n", "-0.526015 | \n", "0.963103 | \n", "0.227400 | \n", "-0.480507 | \n", "0.580577 | \n", "0.149562 | \n", "-0.231371 | \n", "-0.251046 | \n", "0.174849 | \n", "... | \n", "0.201388 | \n", "-0.116860 | \n", "0.107007 | \n", "0.044851 | \n", "0.242405 | \n", "-0.009095 | \n", "0.133066 | \n", "0.010694 | \n", "-0.130124 | \n", "4.127115e-07 | \n", "
| AAACA | \n", "2.029814 | \n", "0.478617 | \n", "1.541410 | \n", "1.394887 | \n", "-0.103456 | \n", "0.165921 | \n", "2.445179 | \n", "0.035222 | \n", "-0.144542 | \n", "0.367994 | \n", "... | \n", "0.299338 | \n", "-0.400981 | \n", "-0.294678 | \n", "1.177327 | \n", "0.726800 | \n", "0.021389 | \n", "0.309360 | \n", "0.199590 | \n", "0.133852 | \n", "9.009335e-06 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| TTTGT | \n", "3.332679 | \n", "1.524216 | \n", "-0.882499 | \n", "-0.696858 | \n", "0.836397 | \n", "-1.156986 | \n", "0.556218 | \n", "0.500014 | \n", "0.156163 | \n", "0.216689 | \n", "... | \n", "-0.609086 | \n", "0.306015 | \n", "0.283575 | \n", "0.606585 | \n", "0.218290 | \n", "-0.430605 | \n", "1.027785 | \n", "0.702348 | \n", "-0.115364 | \n", "4.903661e-07 | \n", "
| TTTTA | \n", "2.298655 | \n", "-0.360357 | \n", "-0.505533 | \n", "1.352307 | \n", "6.025070 | \n", "0.182953 | \n", "0.533421 | \n", "-0.309203 | \n", "0.471387 | \n", "0.024318 | \n", "... | \n", "-0.550174 | \n", "0.808017 | \n", "-0.201772 | \n", "1.759322 | \n", "0.725695 | \n", "-0.542566 | \n", "2.092282 | \n", "3.031618 | \n", "0.390967 | \n", "1.815254e-04 | \n", "
| TTTTC | \n", "2.684310 | \n", "1.833576 | \n", "0.192108 | \n", "-0.550983 | \n", "1.460191 | \n", "0.469683 | \n", "0.416367 | \n", "0.345825 | \n", "1.089355 | \n", "-0.056080 | \n", "... | \n", "-0.101757 | \n", "0.570731 | \n", "0.027791 | \n", "1.139062 | \n", "-0.095064 | \n", "-0.590156 | \n", "1.184556 | \n", "0.822767 | \n", "-0.252124 | \n", "9.926395e-08 | \n", "
| TTTTG | \n", "1.489311 | \n", "-0.156888 | \n", "-0.669274 | \n", "0.185673 | \n", "1.648772 | \n", "0.578333 | \n", "0.343027 | \n", "0.667288 | \n", "0.128079 | \n", "-0.152963 | \n", "... | \n", "-0.138289 | \n", "0.307081 | \n", "-0.010085 | \n", "0.994944 | \n", "0.112638 | \n", "-0.182646 | \n", "0.764041 | \n", "0.776003 | \n", "-0.557506 | \n", "2.803548e-09 | \n", "
| TTTTT | \n", "2.870706 | \n", "-0.490323 | \n", "-1.859482 | \n", "0.103815 | \n", "0.915930 | \n", "1.053966 | \n", "0.071009 | \n", "0.573608 | \n", "0.114060 | \n", "0.178955 | \n", "... | \n", "-0.906039 | \n", "0.441809 | \n", "0.200234 | \n", "0.844890 | \n", "0.371698 | \n", "-0.195801 | \n", "1.786632 | \n", "1.891382 | \n", "0.064394 | \n", "4.002759e-06 | \n", "
1024 rows × 22 columns
\n", "