===================================================================================== gpmap-tools: tools for inference and visualization of complex genotype-phenotype maps ===================================================================================== ``gpmap-tools`` was created to integrate and standardize methods developed in the `McCandlish Lab `_ for the inference, analysis, and visualization of complex genotype-phenotype maps comprising up to millions of genotypes. For example, this is a low-dimensional representation of a 12-nucleotide genotype-phenotype map comprising 16 million genotypes, where fitness depends on the 4-amino acid sequence they encode under the standard genetic code. This visualization illustrates how the vast set of functional sequences is accessible to one another under a given evolutionary model and highlights the long, winding mutational paths required to evolve sequences differing at only a few amino acid positions. .. image:: figures/visualization.png :width: 80% Inference ========= ``gpmap-tools`` implements a suite of previously proposed Gaussian process models for the inference and analysis of complete genotype-phenotype maps. - `Zhou J and McCandlish DM. Minimum epistasis interpolation for sequence-function relationships (2020) `_ - `Zhou J, Wong MS, Chen WC, Krainer AR, Kinney JB, McCandlish DM. Higher order epistasis and phenotypic prediction (2022) `_ - `Chen WC, Zhou J, Sheltzer JM, Kinney JB, McCandlish DM. Field theoretic density estimation for biological sequence space with applications to 5' splice site diversity and aneuploidy in cancer (2021) `_ These models are implemented within a unified framework inspired by the classical machine learning library `scikit-learn `_, defined as classes with the common methods ``fit`` and ``predict`` to infer hyperparameters and make phenotypic predictions, respectively. Among other things, ``gpmap-tools`` allows to: - Estimate the magnitude of genetic interactions of different orders from experimental measurements. - Infer a complete combinatorial genotype-phenotype maps containing millions of sequences from experimental measurements and observations of natural sequences. - Predict the phenotypes of unobserved genotypes with associated uncertainty. - Estimate the effects of mutations in specific and possibly unobserved genetic backgrounds with associated uncertainty. - Compute the variance explained by interactions of different orders involving combinations of sites in a complete genotype-phenotype map. Visualization ============= ``gpmap-tools`` also provides a new and accessible implementation of a previously described powerful method to visualize complex genotype-phenotype maps. - `McCandlish DM. Visualizing fitness landscapes (2011) `_ More specifically, ``gpmap-tools`` allows to: - Compute the coordinates of the low-dimensional representation of a genotype-phenotype map. - Write and read intermediate files in efficient and fast ``parquet`` and ``npz`` formats. - Plot the low-dimensional representation efficiently using different backends with varying degrees of computational speed and flexibility. For example, using `datashader `_, we can render visualizations of genotype-phenotype maps with millions of genotypes efficiently. - Identify the sequence features that characterize different regions of the genotype-phenotype map with additional plotting functionalities. History and applications ======================== Initially designed as an internal library to allow new lab members and students to analyze high-throughput combinatorial datasets without requiring advanced expertise in the original code, we have used ``gpmap-tools`` in a number of collaborative studies to address a broad range of interesting biological questions by understanding the structure of high-dimensional genotype-phenotype maps: - How do mutations in the hydrophobic core of the GFP protein interact with each other and how can they be leveraged to design proteins with new functionality? `Jonathan Yaacov Weinstein, Carlos Martí-Gómez, Rosalie Lipsh-Sokolik, Shlomo Yakir Hoch, Demian Liebermann, Reinat Nevo, Haim Weissman, Ekaterina Petrovich-Kopitman, David Margulies, Dmitry Ivankov, David M. McCandlish & Sarel J. Fleishman. Designed active-site library reveals thousands of functional GFP variants (2023) `_ - How can functional orthogonality arise in pairs of interacting proteins? `Ziv Avizemer, Carlos Martí‐Gómez, Shlomo Yakir Hoch, David M. McCandlish, Sarel J. Fleishman. Evolutionary paths that link orthogonal pairs of binding proteins (2023) `_ - How does the structure of the genetic code influence the navigability of complex protein genotype-phenotype maps? `Hana Rozhoňová, Carlos Martí-Gómez, David M McCandlish, Joshua L Payne. Robust genetic codes enhance protein evolvability (2024) `_ ``gpmap-tools`` is now available to the broader scientific community and provides access to advanced computational techniques through just a few lines of Python code, while also extending its core functionalities. Find more details through the documentation and in our preprint linked below. Citation ======== - `Carlos Martí-Gómez, Juannan Zhou, Wei-Chia Chen, Justin B. Kinney, David M. McCandlish. Inference and visualization of complex genotype-phenotype maps with gpmap-tools (2025) `_ YouTube Talk ============ We presented some of this work at the `Variant Effects Seminar Series `_, and the 25-minute talk is publicly available on its YouTube channel `here `_. .. youtube:: glQ0jllgdfY .. toctree:: :maxdepth: 2 :caption: Table of Contents installation usage/getting_started.ipynb inference visualization evolution usage/7_datasets.ipynb api