Weekly reads 14/7/25

From spatial screens to sequence grammar: innovations driving single-cell and multi-omic discovery

Jul 20, 2025

This week’s selection of papers reflects the rapidly expanding toolbox for decoding gene regulation, tissue architecture, and treatment resistance—across modalities, scales, and even organisms. From Perturb-Multi, a spatially resolved CRISPR screening platform, to GLM-Prior’s genomic language modeling for regulatory networks, the papers span innovations in single-cell multi-modal integration, enhancer design, and microbiome-immunotherapy crosstalk. Together, they highlight a recurring theme: turning complexity—whether in sequence, structure, or cellular neighborhoods—into clarity through better models, measurements, and mechanistic insights.

Preprints/articles that i managed to read this week

Perturb-Multimodal: A Platform for Pooled Genetic Screens with Imaging and Sequencing in Intact Mammalian Tissue

Saunders et al., Cell (2025). DOI: 10.1016/j.cell.2025.05.022

The paper in one sentence

Perturb-Multi combines CRISPR-based genetic perturbations, single-cell RNA sequencing, and multiplexed imaging to map genotype-phenotype relationships in intact tissues with subcellular resolution.

Summary

The study introduces Perturb-Multi, a groundbreaking platform that pairs pooled CRISPR screens with multimodal phenotyping (scRNA-seq + spatial imaging) in intact mouse liver tissue. By fixing and analyzing the same tissue with both sequencing (transcriptomes) and imaging (protein/mRNA localization, morphology), it links genetic perturbations to diverse cellular states, spatial organization, and dynamic responses to metabolic stress. Key applications include dissecting hepatocyte zonation, stress pathways, and lipid droplet regulation, revealing convergent and divergent mechanisms underlying steatosis.

Personal highlights

Multimodal fusion of Perturb-seq and spatial imaging: Perturb-Multi uniquely integrates single-cell transcriptomics with RCA-MERFISH (for RNA/protein imaging) and deep-learning-based morphology analysis, enabling simultaneous measurement of gene expression, subcellular features, and tissue architecture in CRISPR-perturbed cells.
Fixed-cell Perturb-seq for in vivo screens: overcomes challenges of live-cell isolation by optimizing fixed-tissue dissociation and sgRNA detection, preserving transcriptomes while enabling scalable, spatially resolved CRISPR screens in native tissue contexts.
Rolling-circle amplified barcoding (RCA-MERFISH): innovates probe design and enzymatic amplification to image short perturbation barcodes alongside endogenous RNAs, achieving >100× efficiency gains and robust detection in perfusion-fixed tissues.
Dynamic phenotype discovery across scales: reveals how genetic perturbations alter hepatocyte zonation (e.g., Hs6st1 knockout mimics Wnt pathway effects), ER stress responses (e.g., Self1 knockout downregulates secretory mRNAs), and steatosis via distinct transcriptional vs. lipid-sequestration mechanisms.

Why should we care?

Perturb-Multi bridges a critical gap in functional genomics: not just which genes affect cell states, but how they reshape tissue organization and physiology in vivo. By coupling CRISPR screens with spatial multi-omics, it transforms our ability to:

Decipher disease mechanisms (e.g., steatosis drivers in metabolic liver disease) with cellular and subcellular resolution.
Uncover context-dependent gene function, like diet-specific stress responses or zonation regulators, beyond what cell cultures reveal.
Empower machine learning in biology by providing ground-truth datasets linking genetic perturbations to multidimensional phenotypes.

For biologists, it’s a toolkit to dissect tissue complexity; for computational scientists, a gold standard for validating spatial models; and for translational researchers, a path to identifying therapeutic targets with spatial precision.

GLM-Prior: A Nucleotide Transformer Model Reveals Prior Knowledge as the Key to Gene Regulatory Network Inference

Gibbs, C. S., Chen, A., Bonneau, R., & Cho, K. (2025). GLM-Prior: A nucleotide transformer model reveals prior knowledge as the driver of GRN inference performance. bioRxiv. https://doi.org/10.1101/2025.06.29.662198

The paper in one sentence

GLM-Prior, a genomic language model, leverages DNA sequence data to construct high-quality prior knowledge for gene regulatory network (GRN) inference, outperforming traditional methods and revealing that expression data primarily refines—rather than discovers—regulatory structure.

Summary

The study introduces GLM-Prior, a transformer-based model fine-tuned to predict transcription factor (TF)-gene interactions directly from nucleotide sequences. Integrated with the probabilistic matrix factorization model PMF-GRN, GLM-Prior forms a dual-stage pipeline that decouples prior-knowledge construction from expression-based GRN inference. Experiments across yeast, mouse, and human demonstrate that GLM-Prior captures most regulatory information directly from sequence, with expression data adding only marginal value. The model also generalizes well between closely related species (e.g., human-to-mouse transfer) but struggles with distant species like yeast, highlighting the importance of evolutionary conservation in regulatory logic.

Personal highlights

Sequence-driven prior knowledge outperforms curated databases and motif-based methods: GLM-Prior achieves higher accuracy (AUPRC) than YEASTRACT and motif-based priors in yeast, demonstrating that DNA sequence alone encodes most regulatory signals.
Expression data refines—rather than rebuilds—regulatory networks: when paired with high-quality priors, GRN inference primarily prunes unsupported edges, suggesting expression acts as a context-specific filter rather than a source of novel discoveries.
Cross-species transfer without retraining: a human-trained GLM-Prior model outperforms a mouse-specific model in predicting mouse regulatory interactions, showcasing conserved regulatory logic between evolutionarily close species.
Limitations in distant species generalization: the model fails to transfer effectively to yeast, underscoring the divergence in regulatory grammar and TF binding motifs across deep evolutionary distances.
Uncertainty-aware GRN inference: PMF-GRN provides well-calibrated posterior variance estimates, enabling interpretable confidence scores for predicted edges and highlighting high-confidence regulatory interactions.

Why should we care?

GLM-Prior shifts the paradigm of GRN inference by showing that prior knowledge construction—not algorithm complexity—is the bottleneck in accurate network reconstruction. For biologists, this means:

High-quality priors reduce reliance on noisy expression data, offering more robust baselines for studying gene regulation.
Cross-species transfer enables GRN inference in less-studied organisms, leveraging well-annotated genomes (e.g., human) to predict interactions in related species (e.g., mouse).

For computational researchers, the work highlights:

The diminishing returns of expression-centric inference when priors are strong, urging a focus on sequence-driven scaffolding.
The promise of foundation models in genomics, as transformer architectures capture long-range dependencies and regulatory grammar better than motif-based approaches.

By treating expression data as a modulator of a sequence-derived scaffold—rather than the primary signal—GLM-Prior opens new avenues for interpretable, generalizable, and biologically grounded GRN modeling.

Iterative Deep Learning Design of Human Enhancers Exploits Condensed Sequence Grammar for Cell-Type Specificity

Yin et al., Cell Systems (2025). DOI: 10.1016/j.cels.2025.101302

The paper in one sentence

Deep learning models iteratively design synthetic enhancers with unprecedented cell-type specificity by compressing transcription factor binding motifs into a minimal, high-impact sequence grammar.

Summary

Yin et al. combine iterative deep learning and experimental validation to create synthetic enhancers that outperform natural sequences in targeting gene expression to specific human cell lines (HepG2 and K562). Starting from chromatin accessibility or MPRA data, their models progressively refine enhancer designs through two rounds of training, uncovering a "condensed grammar" of motifs that drive specificity. Key innovations include motif density optimization, iterative small-data retraining, and single-cell validation linking enhancer activity to transcription factor expression. The resulting enhancers achieve up to 46× higher target-cell expression while being as short as 50 bp, offering practical tools for gene therapy and synthetic biology.

Personal highlights

Iterative small-data design breaks scalability barriers: by retraining models on just ~1,000 high-quality synthetic enhancers (30× smaller than initial datasets), the authors achieve dramatic specificity improvements—proving that targeted, low-throughput experiments can outperform brute-force screening.
Condensed motif grammar outperforms natural enhancers: synthetic designs pack transcription factor binding sites (TFBS) at 3× higher density than natural sequences, using a selective "vocabulary" of motifs (e.g., TP53 for HepG2, GATA1::TAL1 for K562) to maximize specificity while minimizing length.
Single-cell MPRA validates causal TF-enhancer links: scMPRA reveals that enhancer activity correlates with cognate transcription factor expression at single-cell resolution, confirming the biological relevance of model-designed sequences.
Motif interactions decoded via ablation experiments: perturbation studies uncover additive, redundant, and cooperative motif interactions, providing a blueprint for rational enhancer engineering (e.g., TP53 position-dependence, GATA1::TAL1/NFE2 cooperativity).
Short enhancers retain function: designs as compact as 50 bp maintain specificity—critical for gene therapy applications like AAV vectors, where payload size is limited.

Why should we care?

This work bridges the gap between computational biology and real-world applications. For gene therapy, it offers a roadmap to design compact, highly specific enhancers that minimize off-target effects. Synthetic biologists gain a framework to engineer regulatory elements without relying on noisy genomic screens, while developmental biologists see how motif grammar evolves toward specificity. For machine learning, it demonstrates how iterative "small data" fine-tuning can outperform large-scale training. By linking model predictions to single-cell TF activity, the authors also provide a template for validating in silico designs in situ—a leap toward trustworthy AI-driven bioengineering.

Microbiota-Driven Antitumor Immunity via Dendritic Cell Migration: A New Bacterial Strain Enhances Cancer Immunotherapy

Lin et al. (2025). Nature. DOI: 10.1038/s41586-025-09249-8

The paper in one sentence

A novel gut bacterium, Hominenteromicrobium YB328, boosts antitumor immunity by activating dendritic cells to migrate to tumors and prime CD8+ T cells, enhancing the efficacy of PD-1 checkpoint blockade therapy.

Summary

The study identifies YB328, a strain of Hominenteromicrobium enriched in patients responsive to PD-1 immunotherapy, which stimulates CD103+CD11b− dendritic cells (cDCs) in the gut to migrate to tumors, activate tumor-specific T cells, and improve immunotherapy outcomes across multiple cancer types.

Personal highlights

Discovery of a keystone immunotherapy-enhancing bacterium: YB328, a previously undescribed strain from the Ruminococcaceae family, dominates gut microbiota in PD-1 therapy responders and can be supplemented to reprogram non-responders’ microbiomes.
Dendritic cell "maturation-to-migration" axis: YB328 triggers cDCs via TLR7/9 and mTOR/STAT3 signaling, inducing their migration to tumors—where they sustain PD-1+CD8+ T cell activation against diverse tumor antigens.
Overcoming bacterial competition: YB328’s therapeutic effects are abolished by co-administration of Bacteroidaceae (enriched in non-responders), highlighting the importance of microbial ecology in treatment design.
Mechanistic rigor meets clinical relevance: the study links YB328 abundance in patients to cDC infiltration and prolonged survival, validated across melanoma, lung, gastric, and head/neck cancers.
Beyond "bug-to-drug" simplicity: YB328 doesn’t directly kill tumors but optimizes immune recognition—lowering the activation threshold for T cells to target immunodominant and subdominant tumor antigens.

Why should we care?

This work transforms our understanding of microbiome-cancer interactions from correlation to mechanism, revealing how specific bacteria orchestrate immune responses at a distance. For patients, YB328 could become a predictive biomarker or live therapeutic to overcome immunotherapy resistance. For researchers, it provides a blueprint for dissecting microbiome-immune crosstalk, emphasizing dendritic cells as pivotal messengers. Clinically, it opens avenues for precision microbiome modulation—whether via fecal transplants, bacterial consortia, or synthetic TLR agonists—to amplify immunotherapy’s reach.

Exercise-Induced Gut Microbiome Changes Boost Cancer Immunotherapy by Enhancing CD8 T Cell Function

Phelps et al., Cell (2025). DOI: 10.1016/j.cell.2025.06.018

The paper in one sentence

Exercise promotes gut microbiome production of formate, which activates CD8 T cells via the Nrf2 pathway to enhance antitumor immunity and improve immune checkpoint inhibitor efficacy in melanoma.

Summary

This study reveals a novel mechanism by which exercise improves cancer immunotherapy outcomes: it reshapes the gut microbiome to increase production of formate, a metabolite that boosts CD8 T cell function. Using preclinical melanoma models, the authors show that exercise-induced formate enhances tumor antigen-specific CD8 T cell responses and synergizes with anti-PD-L1 therapy. The gut microbiome is essential for this effect, as antibiotics or germ-free conditions abolish it. Formate works by activating the Nrf2 pathway in CD8 T cells, driving their proliferation and effector function. Importantly, human melanoma patients with high gut microbiome formate production respond better to immunotherapy, suggesting formate as a potential biomarker and therapeutic target.

Personal highlights

Microbiome as exercise’s middleman: exercise remodels gut microbiota to enrich bacteria producing formate, a metabolite that directly enhances CD8 T cell antitumor activity—linking physical activity to immune function via microbial metabolism.
Formate as a T cell turbocharger: microbiota-derived formate activates the Nrf2 pathway in CD8 T cells, amplifying their proliferation, cytokine production, and tumor-killing capacity—independent of other immune cells or checkpoint blockade.
Gut-to-tumor axis: exercise-induced formate travels from the gut to systemic circulation and tumors, correlating with reduced tumor growth and elevated intratumoral CD8 T cell responses—a clear metabolite-mediated immune boost.
Human relevance confirmed: high formate-producing gut microbiomes in melanoma patients associate with better immunotherapy response, and transplanting high-formate human microbiota into mice recapitulates the antitumor effects seen with exercise.
Beyond melanoma: formate restrains growth in multiple cancer models (melanoma, lymphoma, adenocarcinoma) and combats lung metastases, suggesting broad applicability for improving T cell-based therapies.

Why should we care?

This work transforms our understanding of how lifestyle choices like exercise can "reprogram" the immune system to fight cancer—via the gut microbiome. For patients, it suggests that combining exercise with immunotherapy could amplify treatment efficacy, while formate or its microbial producers might serve as next-generation adjuvants. For researchers, it unveils formate-Nrf2 signaling as a druggable axis to enhance CD8 T cell function, offering a roadmap to harness microbial metabolites for cancer therapy. Beyond oncology, the findings hint at microbiome-metabolite-immune crosstalk as a universal mechanism that could be tapped for infections, autoimmunity, or aging.

Bonus takeaway: The study also implies that sedentary lifestyles might impair immune surveillance not just through inactivity, but by depriving the microbiome of cues to produce metabolites like formate—a compelling reason to keep moving.

Harnessing Spatial Statistics for Spatial Omics Data with pasta

Emons, M., Gunz, S., Crowell, H. L., Mallona, I., Kuehl, M., Furrer, R., & Robinson, M. D. (2025). Harnessing the Potential of Spatial Statistics for Spatial Omics Data with pasta. [Journal Name].

The paper in one sentence

The pasta framework bridges spatial statistics and spatial omics, offering versatile tools to analyze point patterns and lattice data for quantifying biological phenomena like cellular co-localization and gene co-expression in tissues.

Summary

The paper introduces pasta, a computational framework that leverages spatial statistics to analyze spatial omics data, which preserves the spatial context of molecular measurements. It distinguishes between two data modalities—point patterns (e.g., single-cell imaging) and lattice data (e.g., spot-based sequencing)—and provides tailored methods for each. Key applications include identifying hormone receptor-positive regions in breast cancer (lattice data) and quantifying tumor invasion patterns (point patterns). The work emphasizes the importance of scale, homogeneity, and neighborhood definitions in spatial analysis and includes an accessible vignette with R/Python code.

Personal highlights

Dual-modality framework for spatial omics: pasta elegantly distinguishes between point patterns (stochastic, event-based) and lattice data (fixed, observation-based), enabling tailored analyses for imaging- and sequencing-based technologies.
Spatial autocorrelation for localized gene expression: uses Moran’s I and Geary’s *c* to identify regions of high/low gene expression coherence, revealing clinically relevant hormone receptor patterns in breast cancer.
Quantifying cellular co-localization with point patterns: applies Besag’s L function to measure tumor invasion dynamics, distinguishing between clustered, spaced, or randomly distributed cell types.
Flexible neighborhood definitions: demonstrates how weight matrix construction (contiguity vs. distance-based) impacts interpretation, e.g., ligand-receptor interactions versus paracrine signaling.
Open-source vignettes for reproducibility: provides pasta’s R/Python workflows, integrating with popular ecosystems (Scanpy, SpatialExperiment) to lower barriers for adoption.

Why should we care?

pasta equips researchers with robust statistical tools to decode the spatial organization of tissues—answering not just where molecules are located, but how their arrangement drives biology. For biologists, it offers quantifiable insights into disease microenvironments (e.g., tumor heterogeneity) or developmental patterning. For computational scientists, it bridges classical spatial statistics with modern omics, emphasizing interpretability and scalability

Benchmarking Single-Cell Multi-Modal Data Integrations

Fu, S., Wang, S., Si, D., Li, G., Gao, Y., & Liu, Q. (2025). Benchmarking single-cell multi-modal data integrations. Nature Methods. https://doi.org/10.1038/s41592-025-02737-9

The paper in one sentence

A comprehensive benchmark of 40 single-cell multi-modal integration algorithms evaluates usability, accuracy, and robustness across diverse datasets, providing actionable guidance for method selection in genomics research.

Summary

This study systematically compares 65 integration methods (from 40 algorithms) for single-cell multi-modal data, spanning DNA, RNA, protein, and spatial omics. The authors assess performance across paired, unpaired, and mosaic datasets, focusing on usability (e.g., scalability, documentation), accuracy (e.g., biological conservation, batch-effect removal), and robustness (e.g., data sparsity, dataset size). Key findings identify state-of-the-art tools like Seurat v4 WNN (paired RNA-ATAC), GLUE (unpaired diagonal), and MIDAS (cross-modality imputation), while highlighting trade-offs between computational demands and performance. The work includes a user-friendly platform for accessing benchmark results and a reproducible pipeline for future evaluations.

Personal highlights

Systematic evaluation of 65 integration methods: the study benchmarks tools across six integration tasks (e.g., paired RNA-ATAC, mosaic RNA-ADT), providing the most comprehensive comparison to date, with standardized metrics for usability, accuracy, and robustness.
State-of-the-art performers identified: Seurat v4 WNN excels in paired RNA-ATAC integration; GLUE dominates unpaired diagonal tasks; and MIDAS achieves near-paired-imputation accuracy in cross-modality predictions, despite using unpaired data.
Robustness to real-world challenges: methods are tested under varying data sparsity (e.g., 10% sequencing depth) and dataset sizes (up to 500k cells), revealing tools like MultiVI and scMVP as resilient to noise and scalable for large datasets.
Practical guidance for researchers: the benchmark highlights trade-offs—e.g., deep-learning methods (uniPort) offer high accuracy but require GPUs, while graph-based tools (Seurat) balance performance and accessibility.

Why should we care?

This benchmark empowers researchers to navigate the "jungle" of single-cell multi-omics tools with confidence. For computational biologists, it clarifies which methods excel at specific tasks (e.g., spatial integration vs. imputation) and why—saving months of trial-and-error. For wet-lab scientists, it translates technical evaluations into actionable recommendations, ensuring reliable integration of complex datasets.

Other papers that peeked my interest and were added to the purgatory of my “to read” pile

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post