Weekly reads 15/09/22

From epigenetic evolution to AI-designed phages

Sep 21, 2025

This week’s highlights span cancer evolution, computational frameworks, and next-generation functional genomics. We see DNA methylation emerge as both driver and historian of tumor evolution, from the cooperative epigenetic–genetic interplay uncovered by TRACERx to EVOFLUX’s use of stochastic methylation as a natural barcode. On the methods side, ParTIpy scales Pareto task inference to modern single-cell data, SPACE brings CRISPR screens into intact 3D tissue with spatial resolution, and a cautionary study shows why “routine” spATAC-seq normalization may erase biology. Meanwhile, generative AI proves it can design viable phages from scratch, and small cell lung cancer is caught literally wiring itself into brain circuits. Together, these studies highlight the creativity of both cancer and scientists, and the tools we’re building to keep up.

Preprints/articles that I managed to read this week

DNA methylation cooperates with genomic alterations during non-small cell lung cancer evolution

Gimeno-Valiente et al. Nature Genetics (2025). https://doi.org/10.1038/s41588-025-02307-x

The paper in one sentence

This study reveals how DNA methylation works hand-in-hand with genetic changes like copy number alterations to drive lung cancer evolution, uncovering new epigenetic drivers and a novel mechanism called "allosteric chromatin activity transition" (AllChAT) that helps cancer cells tolerate the stress of amplified oncogenes.

Summary

This research from the TRACERx lung cancer study uses an advanced method (CAMDAC) to accurately measure DNA methylation specifically in cancer cells from 217 tumor samples. The team developed two key metrics: one to measure methylation heterogeneity within a tumor (ITMD) and another (MR/MN) to identify genes where methylation is likely functional and under evolutionary selection. They found that epigenetic and genetic changes often work in parallel to silence tumor suppressor genes. Crucially, they discovered that when oncogenes are amplified, the surrounding chromatin can undergo a widespread change (AllChAT), leading to hypermethylation of nearby essential genes. This acts as a "dosage compensation" mechanism, buffering the cell from the harmful effects of over-expressing these essential genes and allowing the cancer to thrive. They also identified new candidate epigenetic driver genes, some of which are linked to worse patient outcomes.

Personal highlights

Deconvolving the tumor epigenome: the use of the CAMDAC tool to purify cancer-cell-specific methylation signals from bulk sequencing data, overcoming the confounding effects of tumor purity and copy number changes that have plagued previous studies.
The MR/MN ratio: an evolutionary metric for epigenetics: the development of a powerful new metric, analogous to the dN/dS ratio in genetics, which distinguishes genes under selection for functionally impactful (regulatory) hypermethylation from those with merely passenger events.
Epigenetic dosage compensation of essential genes: the discovery that DNA hypermethylation is used to silence essential genes co-amplified with potent oncogenes (like KRAS or CCND1), maintaining their expression at tolerable levels and revealing a novel, non-genetic survival strategy for cancer cells.
The Allosteric Chromatin Activity Transition (AllChAT) model: the proposal of a elegant mechanistic model where a copy number alteration at an oncogene locus can trigger a broad, cooperative change in chromatin state (like an allosteric effect in proteins), impacting methylation and expression of passenger genes across a large genomic region.
Early epigenetic events shape genomic trajectories: the finding that promoter hypermethylation of certain genes (like VIPR2, ZNF714) can occur early, even in pre-invasive lesions, and may predispose a tumor to subsequently acquire specific driver mutations (e.g., in STK11 or CDKN2A).

Why should we care?

This work moves beyond cataloging epigenetic changes and fundamentally advances our understanding of how DNA methylation actively cooperates with genetic alterations to fuel cancer. For cancer biologists, it provides a new lens through which to view tumor evolution, highlighting methylation not just as a passive marker but as an active player in managing the selective pressures of genomic instability. The AllChAT model offers a fresh, mechanistic framework for understanding how cancers tolerate massive genomic amplifications. For clinicians, the identified epigenetic drivers and the MR/MN metric open new avenues for patient stratification and could reveal novel therapeutic vulnerabilities, perhaps targeting the epigenetic "brakes" that cancer cells apply to essential genes could synergize with existing therapies.

ParTIpy: A Scalable Framework for Archetypal Analysis and Pareto Task Inference

Schäfer et al. bioRxiv (2025). https://doi.org/10.1101/2025.09.08.674797

The paper in one sentence

ParTIpy is a scalable, open-source Python package that uses archetypal analysis to model biological trade-offs, revealing how cells specialize into distinct functional tasks by finding the extreme points (archetypes) in high-dimensional data like single-cell transcriptomics.

Summary

This paper introduces ParTIpy, a computational tool designed to overcome the limitations of its predecessor, the ParTI MATLAB package. ParTIpy implements the Pareto Task Inference (ParTI) framework, which is grounded in the theory that biological systems evolve toward Pareto optimality, where improving performance in one task (e.g., lipid metabolism) worsens performance in another (e.g., detoxification). The core algorithm, archetypal analysis, identifies these pure "task specialists" (archetypes) as the vertices of a polytope that encloses all cells in gene expression space, with each cell being a mixture of these archetypes. ParTIpy achieves scalability to modern large-scale datasets through algorithmic innovations, including efficient initialization strategies and the use of coresets (small, weighted data subsets), allowing analysis of 100,000+ cells using only 1-10% of the data. The package provides a full workflow: determining the optimal number of archetypes, characterizing them via pathway enrichment, and mapping them onto spatial data to uncover drivers of specialization like chemical gradients or cell-cell communication.

Personal highlights

Scalable archetypal analysis via coresets and optimized algorithms: implements state-of-the-art initialization (Archetypal++) and optimization algorithms, combined with a coreset-based approach that reduces runtime by ~4x on large datasets (>100k cells) by using only a tiny, representative fraction of the data without sacrificing result quality.
A principled framework for continuous biological trade-offs: moves beyond discrete clustering to model cells as existing on a continuum between extreme "specialist" states (archetypes), offering a more natural interpretation of the continuous variation inherent in cellular phenotypes and their functional allocations.
Integrated workflow for archetype characterization and biological interpretation: provides built-in tools not just for finding archetypes, but for making sense of them, including robust methods for selecting their number, pathway enrichment analysis via decoupler-py spatial mapping, and inference of archetypal crosstalk through ligand-receptor analysis.
Constraint relaxation for robust archetype discovery: incorporates a convexity relaxation parameter (δ) that allows inferred archetypes to extend beyond the observed data cloud, preventing underestimation of the true phenotypic space and providing more biologically plausible specialists when sampling is limited.
Seamless integration into the Python single-cell ecosystem: designed as an open-source Python package that adopts standard data structures (AnnData), making it a plug-and-play tool that fits directly into established single-cell analysis workflows and broadens accessibility beyond proprietary software (MATLAB) limitations.

Why should we care?

ParTIpy fills a critical gap in the analysis of high-dimensional biological data by addressing a fundamental biological principle: trade-offs. It provides a powerful, scalable, and interpretable framework to understand why cells vary continuously, not as noise or intermediate states, but as optimal allocations of limited resources to competing functions

Fluctuating DNA methylation tracks cancer evolution at clinical scale

Gabbutt et al. Nature (2025). https://doi.org/10.1038/s41586-025-09374-4

The paper in one sentence

Researchers developed a new method, EVOFLUX, that uses the natural, random fluctuations in DNA methylation as an "evolving barcode" to reconstruct the entire evolutionary history of a patient's cancer from a single, standard DNA test.

Summary

Cancer evolves, and its evolutionary history is written in its DNA. However, reading this history has been expensive and technically challenging. This study introduces EVOFLUX, a powerful computational framework that deciphers this history by analyzing a specific set of 978 CpG sites where DNA methylation stochastically fluctuates over years, acting as a natural cellular barcode. By applying EVOFLUX to nearly 2,000 lymphoid cancer samples, the team quantified key evolutionary parameters, like initial growth rate, tumor age, and epimutation rates, across different cancer types. They found that most cancers grow in a "effectively neutral" manner with little subclonal selection, that evolutionary history is a strong independent prognostic factor, and that in aggressive transformations, the seed of the aggressive clone can exist decades before clinical presentation.

Personal highlights

Leveraging 'junk' epigenetics as an evolutionary clock: the method identifies CpG sites in silent genomic regions where methylation changes are neutral and clock-like, transforming them from noise into a high-resolution tool for lineage tracing and dating clonal expansions directly from bulk tissue.
A Bayesian engine for inferring deep cancer history: EVOFLUX combines a stochastic model of methylation fluctuation and population growth with sophisticated Bayesian inference to quantitatively estimate a tumor's birth date, initial growth rate, and effective population size from a single snapshot.
Most cancers evolve neutrally at the bulk sample level: the analysis reveals that strong subclonal selection is relatively rare within bulk samples, with the majority of cancers (1,610 of 1,976) showing no evidence of selective sweeps, suggesting neutral evolution dominates until later stages.
The seed of aggression is sown decades in advance: in patients whose chronic leukemia transformed into an aggressive cancer, phylogenetic tracing with EVOFLUX detected that the founding cell of the lethal clone diverged from the main tumor over a decade before the initial cancer diagnosis.
Evolutionary history is an independent prognostic biomarker: the initial growth rate inferred by EVOFLUX was a powerful predictor of clinical outcome in chronic lymphocytic leukemia, outperforming or adding independent value to established genetic markers like IGHV status and TP53 mutations.

Why should we care?

This work fundamentally changes how we can study cancer evolution. By turning widely available, low-cost DNA methylation data into a precise readout of a tumor's past, EVOFLUX makes large-scale evolutionary studies clinically feasible. For oncologists, it offers a new, independent prognostic factor that captures the inherent aggressiveness of a tumor's biology. For researchers, it provides a scalable tool to answer fundamental questions about how cancers initiate, progress, and respond to treatment across thousands of patients. Most profoundly, it suggests that the potential for highly aggressive disease may be present years or even decades before it manifests clinically, opening new avenues for early interception and prevention.

Generative AI Designs Functional Bacteriophages from Scratch

King et al. bioRxiv (2025). https://doi.org/10.1101/2025.09.12.675911

The paper in one sentence

Researchers used fine-tuned genome language models to generate entirely novel, functional bacteriophage genomes that successfully infected target bacteria, with some outperforming a natural benchmark phage and rapidly overcoming bacterial resistance.

Summary

This study presents the first successful generative design of complete, functional viral genomes. Using the well-characterized bacteriophage ΦX174 as a template, the team fine-tuned large language models (Evo 1 and Evo 2) on a dataset of Microviridae phage genomes. They developed a sophisticated computational pipeline to generate thousands of novel genome designs and filter them based on quality control, host specificity, and evolutionary novelty. Out of 285 synthesized designs, 16 were functionally viable, capable of infecting and lysing the target E. coli host. These AI-generated phages exhibited significant sequence novelty, with some displaying higher fitness or faster lysis than the natural ΦX174 phage. Crucially, a cocktail of these generated phages could rapidly overcome bacterial resistance in a way that ΦX174 alone could not, showcasing the potential of generative AI to create resilient therapeutic agents.

Personal highlights

Whole-genome design with controllable constraints: the authors developed a multi-tiered filtering system that enforces sequence quality, ensures specific host tropism (e.g., spike protein similarity ≥60%), and promotes evolutionary novelty (e.g., AAI <95%), moving far beyond simple sequence generation to intentional, steerable design.
Overcoming the overlapping gene annotation challenge: standard gene-finding tools failed to annotate all 11 genes in the ΦX174 genome due to extensive overlaps. The team built a custom "pseudo-circularization" and ORF-calling method to accurately predict genes in their generated sequences, a critical step for applying gene-level design rules.
Functional validation of AI-generated organisms: the study bridges the gap between in silico design and biological function. They experimentally rebooted phages from synthesized DNA, with 16 novel genomes demonstrating clear lytic activity, specific host range, and variable fitness, providing a robust framework for validating generative biology.
Unlocking non-viable evolutionary paths: the model designed a viable phage (Evo-Φ36) that incorporated a spike protein from a distantly related phage (G4), a swap that previous rational engineering attempts had found to be non-viable. This demonstrates the AI's ability to find context-dependent solutions that evade human intuition.
Cocktail resilience against evolved resistance: A mixture of the generated phages evolved the ability to infect ΦX174-resistant bacteria within 1-5 passages, while ΦX174 alone failed. Sequencing revealed that this was achieved through recombination and mutation events between the AI-generated phages, showcasing the power of designed diversity to combat adaptation.

Why should we care?

This work is a landmark demonstration that AI can not only learn the complex language of biology but can also write entirely new functional genomic "paragraphs"—in this case, complete viral genomes. For the broader audience, it signals a future where we can computationally design living systems to address pressing challenges. For phage therapy, it suggests a path to rapidly develop bespoke, resilient cocktails that can outmaneuver antibiotic-resistant bacteria. More fundamentally, it provides a powerful new tool for basic science, allowing researchers to systematically explore the vast landscape of possible genomic sequences to answer questions about evolution, gene function, and the rules of life itself. It transforms genome design from a painstaking process of editing what exists into a generative process of creating what could exist.

Neuronal activity-dependent mechanisms of small cell lung cancer pathogenesis

Savchuk et al. Nature (2025). https://doi.org/10.1038/s41586-025-09492-z

The paper in one sentence

Small cell lung cancer (SCLC) cells in the brain hijack the nervous system by forming functional synapses with neurons, using both glutamate and GABA signals to depolarize their membranes and fuel their own growth, while also making the local brain circuitry more excitable.

Summary

This groundbreaking study reveals that the lethal spread of small cell lung cancer (SCLC) to the brain is actively driven by the nervous system itself. The researchers show that SCLC cells don't just grow near neurons; they integrate into brain circuits by forming direct, functional synapses with them. Through these "neuron-to-cancer" synapses, SCLC cells receive excitatory (glutamatergic) and depolarizing (GABAergic) signals that cause calcium influx and membrane depolarization, which is sufficient to trigger tumor cell proliferation and invasion. This relationship is bidirectional: the cancer cells also increase neuronal excitability in their vicinity. Crucially, the study demonstrates that cutting the vagus nerve dramatically reduces primary lung tumor growth in mice, and that an existing anti-seizure drug (levetiracetam) can reduce the growth of SCLC brain metastases by disrupting this harmful neuron-cancer communication.

Personal highlights

SCLC forms bona fide synapses with neurons in the brain: using electron microscopy and electrophysiology, the team provides direct structural and functional evidence that SCLC cells become post-synaptic partners to neurons, receiving both glutamatergic and GABAergic input.
GABA is depolarizing and growth-promoting in SCLC: contrary to its typical inhibitory role in the adult brain, GABAergic signaling depolarizes SCLC cells due to their high intracellular chloride concentration, and this depolarization is a potent driver of tumor proliferation.
Activity-dependent transcriptional reprogramming: co-culturing SCLC cells with neurons activates specific gene programs related to synapses and proliferation. This reprogramming is blocked by tetrodotoxin (TTX), proving it is dependent on neuronal electrical activity, not just secreted factors.
Direct membrane depolarization is sufficient for growth: using optogenetics to artificially depolarize SCLC cell membranes (independent of neurons) was enough to double tumor size, establishing depolarization as a key oncogenic signal.
Therapeutically targetable with repurposed drugs: the anti-epileptic drug levetiracetam, which inhibits synaptic vesicle release, significantly reduced the proliferation and burden of SCLC brain tumors in mice, offering a near-term translational path.

Why should we care?

This study reveals that some cancers can literally "plug in" to the brain's circuitry to steal growth signals. For patients, this opens up a promising new therapeutic avenue: repurposing existing, well-tolerated neurological drugs to literally "unplug" the tumor from its power source. For scientists and clinicians, it establishes a new paradigm for understanding brain metastases, suggesting that targeting the neuro-cancer interface could be a critical strategy across multiple cancer types. It’s a powerful reminder that to defeat a clever enemy like cancer, we must understand the entire microenvironment it corrupts for its own benefit.

Library Size in Spatial ATAC-seq: Technical Confounder or Biology?

Ji, K.X., & Ji, H. bioRxiv (2025). https://doi.org/10.1101/2025.09.15.676443

The paper in one sentence

Standard normalization of spatial ATAC-seq data, which treats total read count (library size) as a technical artifact, inadvertently removes biologically meaningful signal that reflects tissue structure, impairs correlation with gene expression, and can reverse conclusions in differential analysis.

Summary

This pivotal study challenges a fundamental assumption in the analysis of spatial epigenomics data. The authors demonstrate that in spatial ATAC-seq (spATAC-seq), the total number of reads per spot (library size) is not just technical noise but is strongly correlated with underlying tissue biology. By analyzing five diverse datasets, they show that library size patterns mirror tissue anatomy (e.g., higher in the granule cell layer of the hippocampus) and that unnormalized data correlates better with matched spatial RNA-seq data than normalized data does. Crucially, using standard library size normalization or the state-of-the-art TF-IDF method severely degraded the performance of spatial domain detection and, in many cases, completely reversed the biological conclusions of differential analysis. The work serves as a major cautionary note for the field, arguing that blindly normalizing library size can strip away vital biological information and calling for the development of new methods that can disentangle true biological signal from technical artifact.

Personal highlights

Library size is a biological signal, not just noise: the spatial map of raw read counts in spATAC-seq strongly correlates with histological tissue structure across multiple organs and technologies, indicating it captures fundamental biological variation like cellular density or global chromatin openness.
Normalization weakens cross-modality correlation: unnormalized spATAC-seq data showed a stronger correlation with matched spatial RNA-seq data than library-size-normalized or TF-IDF-transformed data, suggesting standard practices remove a shared biological signal present in both modalities.
Spatial domain detection is harmed by normalization: in 4 out of 5 datasets, clustering raw spATAC-seq data identified spatial domains that better matched histological regions than clustering normalized data. In one melanoma case, normalization completely failed to distinguish two biologically distinct tumor compartments.
Normalization can reverse biological conclusions: Differential analysis of a known marker gene (PROX1) in the hippocampus showed upregulation in the correct region with raw data, but this signal disappeared or reversed after normalization. Across the genome, a vast majority (90%) of significant hits showed discordant directions between raw and normalized analyses.

Why should we care?

This work forces a paradigm shift in how we handle a basic—but critical—step in analyzing spatial epigenomics data. It reveals that a routine "quality control" procedure (library size normalization) is, in fact, a major source of error, potentially leading to false biological discoveries and missed signals. For anyone using or developing tools for spATAC-seq, this paper is an essential warning: your default pipeline is likely broken. It moves the field from a simplistic "normalize-and-forget" mindset to a more nuanced understanding that library size is a complex mixture of technical and biological effects.

SPACE: spatially resolved multiomic analysis for high-throughput CRISPR screening in 3D models

Hu et al. bioRxiv (2025). https://doi.org/10.1101/2025.09.14.675819

The paper in one sentence

SPACE is a transformative, imaging-based method that enables high-throughput, whole-transcriptome CRISPR screening within intact 3D tissue models like spheroids, preserving spatial context to directly visualize how genetic perturbations alter cell states, ligand-receptor interactions, and tumor microenvironments.

Summary

This paper introduces SPACE (SPAtial Cell Exploration), a groundbreaking platform that merges large-scale CRISPR screening with high-plex spatial multiomics. Traditional methods like Perturb-seq lose all spatial information by dissociating cells. SPACE overcomes this by using highly multiplexed in situ imaging (CosMx) to detect guide RNAs, the whole transcriptome (~18,000 genes), and up to 68 proteins on the same slide from intact 3D models like spheroids. The authors demonstrate SPACE's power by performing a high-throughput screen on hundreds of cancer-associated fibroblast (CAF)-tumor spheroids. They show it can accurately decode CRISPR identities, robustly profile transcriptomes, and uncover novel biology—such as how knocking out ISG20 in CAFs dampens matrix metalloproteinase (MMP) activity and how RNF213 knockout strengthens ECM-tumor interactions to drive proliferation. Crucially, SPACE's preservation of spatial architecture allows for the direct observation of how a cell's local microenvironment and density modulate the effect of a genetic perturbation, enabling unbiased discovery of spatially variable genes and ligand-receptor interactions that are invisible to dissociative methods.

Personal highlights

Unprecedented multiomic integration on a single slide: SPACE simultaneously detects CRISPR gRNAs, the whole transcriptome (~18,000 genes), and up to 68 proteins within intact 3D tissue models, representing the highest-plex multimodal CRISPR screen achieved to date.
Spatially informed ligand-receptor analysis: by leveraging exact cell positions, SPACE moves beyond statistical inference to directly identify which ligand-receptor pairs are physically enriched at cell-cell interfaces after a perturbation, dramatically increasing confidence in discovered interactions (e.g., increased collagen-CD44 binding after ISG20 KO).
Decoding microenvironmental modulation of perturbations: SPACE reveals that the phenotypic consequence of a genetic knock-out is context-dependent; it showed that MMP pathway activity in control CAFs is regulated by local tumor cell density, an effect that was erased by ISG20 knockout.
Cost-effective, high-throughput spatial transcriptomics: the method drastically reduces the cost of whole-transcriptome profiling at single-cell resolution in 3D models, enabling the screening of hundreds of spheroids and nearly 70,000 cells in a scalable manner that would be prohibitively expensive with sequencing-based approachess

Why should we care?

SPACE shatters the long-standing trade-off between scale and context in functional genomics. It provides a direct answer to the critical question of where a genetic perturbation exerts its effect, not just what the effect is. By preserving the native tissue architecture, it allows researchers to directly visualize how knocking out a gene in one cell type (like a fibroblast) rewires the entire local microenvironment and influences neighboring cells (like tumors) through physical contact and secreted signals. For drug discovery, it offers a powerful, high-throughput platform to identify novel targets and understand compound mechanisms in the most physiologically relevant 3D models. SPACE effectively provides a high-resolution, functional map of disease biology, moving us from inferring networks to directly observing them in action.

Other papers that peeked my interest and were added to the purgatory of my “to read” pile

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post