gpmap-tools: tools for inference and visualization of complex genotype-phenotype maps

gpmap-tools was created to integrate and standardize methods developed in the McCandlish Lab for the inference, analysis, and visualization of complex genotype-phenotype maps comprising up to millions of genotypes.

For example, this is a low-dimensional representation of a 12-nucleotide genotype-phenotype map comprising 16 million genotypes, where fitness depends on the 4-amino acid sequence they encode under the standard genetic code. This visualization illustrates how the vast set of functional sequences is accessible to one another under a given evolutionary model and highlights the long, winding mutational paths required to evolve sequences differing at only a few amino acid positions.

_images/visualization.png

Inference

gpmap-tools implements a suite of Gaussian process models for the inference and analysis of complete genotype-phenotype maps.

These models are implemented within a unified framework inspired by the classical machine learning library scikit-learn, defined as classes with the common methods fit and predict to infer hyperparameters and make phenotypic predictions, respectively. Among other things, gpmap-tools allows to:

  • Estimate the magnitude of genetic interactions of different orders from experimental measurements.

  • Infer a complete combinatorial genotype-phenotype maps containing millions of sequences from experimental measurements and observations of natural sequences.

  • Predict the phenotypes of unobserved genotypes with associated uncertainty.

  • Estimate the effects of mutations in specific and possibly unobserved genetic backgrounds with associated uncertainty.

  • Compute the variance explained by interactions of different orders involving combinations of sites in a complete genotype-phenotype map.

Visualization

gpmap-tools also provides a new and accessible implementation of a previously described powerful method to visualize complex genotype-phenotype maps.

More specifically, gpmap-tools allows to:

  • Compute the coordinates of the low-dimensional representation of a genotype-phenotype map.

  • Write and read intermediate files in efficient and fast parquet and npz formats.

  • Plot the low-dimensional representation efficiently using different backends with varying degrees of computational speed and flexibility. For example, using datashader, we can render visualizations of genotype-phenotype maps with millions of genotypes efficiently.

  • Identify the sequence features that characterize different regions of the genotype-phenotype map with additional plotting functionalities.

History and applications

Initially designed as an internal library to allow new lab members and students to analyze high-throughput combinatorial datasets without requiring advanced expertise in the original code, we have used gpmap-tools in a number of collaborative studies to address a broad range of interesting biological questions by understanding the structure of high-dimensional genotype-phenotype maps:

gpmap-tools is now available to the broader scientific community and provides access to advanced computational techniques through just a few lines of Python code, while also extending its core functionalities. Find more details through the documentation and in our preprint linked below.

Citation

YouTube Talk

We presented some of this work at the Variant Effects Seminar Series, and the 25-minute talk is publicly available on its YouTube channel here.