gpmap-tools: tools for inference and visualization of complex genotype-phenotype maps
gpmap-tools was created to integrate and standardize methods developed in the
McCandlish Lab
for the inference, analysis, and visualization of complex genotype-phenotype maps
comprising up to millions of genotypes.
For example, this is a low-dimensional representation of a 12-nucleotide genotype-phenotype map comprising 16 million genotypes, where fitness depends on the 4-amino acid sequence they encode under the standard genetic code. This visualization illustrates how the vast set of functional sequences is accessible to one another under a given evolutionary model and highlights the long, winding mutational paths required to evolve sequences differing at only a few amino acid positions.
Inference
gpmap-tools implements a suite of Gaussian process models for the
inference and analysis of complete genotype-phenotype maps.
Carlos Martí-Gómez and David M. McCandlish. Inference of fitness landscapes with heterogenous patterns of epistasis across sites (2026). In preparation.
These models are implemented within a unified framework inspired by the classical
machine learning library scikit-learn, defined
as classes with the common methods fit and predict to infer hyperparameters and
make phenotypic predictions, respectively. Among other things, gpmap-tools allows
to:
Estimate the magnitude of genetic interactions of different orders from experimental measurements.
Infer a complete combinatorial genotype-phenotype maps containing millions of sequences from experimental measurements and observations of natural sequences.
Predict the phenotypes of unobserved genotypes with associated uncertainty.
Estimate the effects of mutations in specific and possibly unobserved genetic backgrounds with associated uncertainty.
Compute the variance explained by interactions of different orders involving combinations of sites in a complete genotype-phenotype map.
Visualization
gpmap-tools also provides a new and accessible implementation of a
previously described powerful method to visualize complex genotype-phenotype maps.
More specifically, gpmap-tools allows to:
Compute the coordinates of the low-dimensional representation of a genotype-phenotype map.
Write and read intermediate files in efficient and fast
parquetandnpzformats.Plot the low-dimensional representation efficiently using different backends with varying degrees of computational speed and flexibility. For example, using datashader, we can render visualizations of genotype-phenotype maps with millions of genotypes efficiently.
Identify the sequence features that characterize different regions of the genotype-phenotype map with additional plotting functionalities.
History and applications
Initially designed as an internal library to allow new lab members and students to
analyze high-throughput combinatorial datasets without requiring advanced expertise
in the original code, we have used gpmap-tools in a number of collaborative studies
to address a broad range of interesting biological questions by understanding the structure
of high-dimensional genotype-phenotype maps:
How do mutations in the hydrophobic core of the GFP protein interact with each other and how can they be leveraged to design proteins with new functionality?
How can functional orthogonality arise in pairs of interacting proteins?
How does the structure of the genetic code influence the navigability of complex protein genotype-phenotype maps?
gpmap-tools is now available to the broader scientific community and provides
access to advanced computational techniques through just a few lines of Python code,
while also extending its core functionalities. Find more details through the
documentation and in our preprint linked below.
Citation
YouTube Talk
We presented some of this work at the Variant Effects Seminar Series, and the 25-minute talk is publicly available on its YouTube channel here.