Weekly reads 22/12/25

From spatial qc to generative tissue models

Dec 29, 2025

After an unplanned two-week pause due to unforeseen circumstances, this roundup is back with a particularly method-heavy set of papers. For the next couple of weeks, I’ll also be skipping the usual “Papers added to my readlist” to catch up on some busy work.
This week’s reads collectively aim to push spatial and cancer genomics beyond descriptive mapping toward context-aware inference, simulation, and generative modeling. From SpotSweeper-py, which brings neighborhood-aware quality control to the Python spatial omics ecosystem, to SIID, which jointly imputes missing genes and deconvolves cell mixtures across spatial platforms, these methods rethink how we preprocess and integrate spatial data. SpatialProp and Novae extend this trajectory by modeling how signals propagate through tissue and by learning panel-invariant spatial domains at foundation-model scale, respectively. Complementing these advances, ecDNAInspector adds much-needed confidence stratification to ecDNA structural predictions in cancer, while TissueNarrator reframes spatial transcriptomics itself as a language modeling problem, enabling generative simulations and natural-language interrogation of tissue architecture.

Preprints/articles that I managed to read this week

Spotsweeper-py: spatially-aware quality control metrics for spatial omics data in the Python ecosystem

Chen et al. bioRxiv (2025). https://doi.org/10.64898/2025.12.06.692760

The paper in one sentence

Researchers present SpotSweeper-py, a Python port of the R package SpotSweeper that implements neighborhood-aware quality control for spatial omics data, significantly reducing the over-filtering of biologically meaningful regions common to global QC methods.

Summary

Traditional quality control (QC) for spatial transcriptomics data, borrowed from single-cell RNA-seq workflows, applies global thresholds (e.g., median ± 3 MADs) across an entire tissue section. This often leads to the aggressive removal of large, contiguous regions that may be biologically relevant but have systematically lower gene counts—like tissue boundaries or gradients—while missing small, localized technical artifacts. SpotSweeper-py solves this by making spatially-aware QC accessible in Python. For each spot (or bin), it calculates a robust *z*-score by comparing its QC metric (e.g., log total counts, detected genes, mitochondrial percentage) to the median and MAD of its *k*-nearest spatial neighbors. This local approach effectively flags micro-tears and other small-scale technical failures while preserving large low-signal regions that are likely real tissue features. The package integrates seamlessly with the scverse ecosystem (AnnData, Scanpy) and is demonstrated to be effective and computationally efficient on both standard 10x Visium and high-resolution Visium HD data, where it dramatically reduces false-positive outlier rates compared to global QC.

Personal highlights

Bringing spatially-aware QC to the Python ecosystem: SpotSweeper-py successfully ports the proven local outlier detection logic of the original R/Bioconductor SpotSweeper package into a native Python tool, bridging a critical gap for the rapidly growing scverse (Scanpy, Squidpy, SpatialData) user base.
Dramatic reduction in over-filtering of biologically meaningful tissue: applied to a Visium breast cancer dataset, global QC using log total counts flagged 5.49% of spots as low-quality, while SpotSweeper-py flagged only 1.13%. The over-flagged spots by global QC formed large, contiguous regions that the local method correctly preserved.
Effective detection of focal technical artifacts: conversely, for metrics like mitochondrial percentage, SpotSweeper-py identified small, coherent clusters of outliers (0.38% of spots) where mitochondrial content spiked locally, indicative of micro-degradation, that were missed by broader global thresholds.

Joint imputation and deconvolution of gene expression across spatial transcriptomics platforms

Zheng et al. Genome Research (2025). https://genome.cshlp.org/content/35/12/2734

The paper in one sentence

SIID is a computational method that integrates paired high-resolution targeted and low-resolution whole-transcriptome spatial transcriptomics datasets to simultaneously impute missing genes and deconvolve cell type mixtures, all while preserving spatial information.

Summary

SIID (Spatial Integration for Imputation and Deconvolution) addresses a key challenge in spatial transcriptomics: different platforms trade off spatial resolution and gene coverage. Using a joint nonnegative matrix factorization model constrained by spatial alignment, SIID reconstructs a latent gene expression matrix from paired datasets, such as 10x Xenium (high resolution, targeted genes) and 10x Visium (lower resolution, whole transcriptome). This allows imputation of unmeasured genes in the high-resolution data and deconvolution of cell type proportions in the low-resolution spots. SIID outperforms existing methods in simulations and real cancer datasets, enabling finer-grained cell typing and better characterization of tissue microenvironments.

Personal highlights

Spatially-informed joint factorization: SIID uses a shared nonnegative matrix factorization model that leverages spatial alignment between two SRT platforms, enabling simultaneous imputation and deconvolution while preserving tissue architecture.
Dual-mode integration from a single model: unlike previous tools that treat SRT data as nonspatial or require scRNA-seq references, SIID jointly estimates latent cell-type expression and spot assignments, handling both missing genes in targeted panels and mixed signals in lower-resolution data.
Platform-aware modeling with scaling factors: the method accounts for platform-specific expression differences through gene-wise scaling factors, improving cross-technology integration without assuming identical count distributions.
Entropy regularization for crisp cell-type assignments: an entropy penalty encourages each high-resolution spot to map primarily to one latent factor (cell type), sharpening deconvolution results and improving interpretability.

SpatialProp: tissue perturbation modeling with spatially resolved single-cell transcriptomics

Sun et al. bioRxiv (2025). https://doi.org/10.64898/2025.11.30.691355

The paper in one sentence

SpatialProp is a graph neural network-based framework that predicts how single-cell gene perturbations propagate through intact tissue, enabling in silico simulation of spatially resolved cellular responses.

Summary

SpatialProp addresses a critical gap in functional genomics: while single-cell perturbation tools like Perturb-seq measure cell-intrinsic effects, they largely ignore how perturbations affect neighboring cells within intact tissue. Using spatially resolved single-cell transcriptomics data, SpatialProp trains a graph neural network to predict a cell’s gene expression from its local microenvironment. This model can then simulate the tissue-wide impact of user-defined perturbations, propagating changes cell-by-cell. The authors also introduce CausalInteractionBench, a novel benchmark built from Gene Ontology terms to evaluate whether predicted perturbation effects are causally enriched for known biological interactions. Applied to multiple brain and heart datasets, SpatialProp accurately predicts gene expression from microenvironment context, captures fine-grained intra-cell-type heterogeneity, and shows enrichment for directional cell–cell signaling pathways.

Personal highlights

Microenvironment-aware perturbation propagation: SpatialProp uses a simple yet powerful GNN trained to predict a cell’s expression solely from its neighbors, enabling perturbation effects to diffuse realistically through tissue architecture without relying on cell type as a proxy.
Fine-grained steering of tissue microenvironments: Through a novel “steering” experiment, SpatialProp demonstrates it can model how subtle changes in a cell’s local neighborhood shift its transcriptional state, even within the same cell type, validating its sensitivity to spatial context.
Causal benchmarking with biological ground truth: The introduction of CausalInteractionBench provides a principled way to assess whether perturbation predictions are causally meaningful, using curated sender–receiver gene sets from Gene Ontology to test directional signaling enrichment.
Sparse, calibrated updates guard against overprediction: A post-processing step called SparseRenorm ensures predicted perturbations are sparse and conserve total expression per cell, making outputs biologically plausible and robust to model error.

Novae: a graph-based foundation model for spatial transcriptomics data

Blampey et al. Nature Methods (2025). https://doi.org/10.1038/s41592-025-02899-6

The paper in one sentence

Novae is a graph neural network foundation model trained on ~30 million cells across 18 tissues to perform zero-shot, panel-invariant spatial domain identification with built-in batch-effect correction and hierarchical organization.

Summary

Novae tackles a central challenge in spatial transcriptomics: integrating and comparing data across multiple slides, technologies, and gene panels. Built on a self-supervised graph attention network and trained on a massive dataset of nearly 30 million cells from three imaging-based platforms (Xenium, MERSCOPE, CosMX), Novae learns a shared latent representation of cellular microenvironments. Unlike existing methods, it requires no gene panel intersection, performs native batch-effect correction via optimal transport, and identifies hierarchical spatial domains in zero-shot fashion, without retraining. Benchmarks across breast, colon, and synthetic datasets show superior performance in domain continuity and cross-slide homogeneity, while applications in lymph node architecture, Alzheimer’s mouse models, and head-and-neck cancer reveal its ability to uncover biologically meaningful spatial patterns and reorganizations.

Personal highlights

Zero-shot, panel-invariant spatial domain inference: Novae can be applied directly to new slides, even with unseen gene panels, without retraining, using pre-learned prototypes that represent elementary spatial domains across tissues and technologies.
Native batch-effect correction via optimal transport: Instead of relying on external tools like Harmony, Novae uses a SwAV-inspired optimal transport objective to align prototypes across slides, preventing over-correction while preserving biologically distinct domains.
Hierarchical, prototype-based domain organization: Spatial domains are derived from a set of learned prototypes, enabling efficient, multi-resolution clustering without recomputing embeddings from fine-grained niches to tissue-level regions.
Multimodal extension to histopathology and proteomics: Novae can fuse transcriptomic graphs with H&E patch embeddings (e.g., from CONCH) or protein expression profiles, enhancing domain segmentation with morphological or protein-context.
Scalability and speed independent of external clustering: By avoiding Leiden/Harmony for inference, Novae assigns domains in seconds even on millions of cells, and its subgraph-based training prevents over-smoothing and scales to whole-slide atlases.

High-confidence structural predictions of extrachromosomal DNA with ecDNAInspector

Pribus et al. bioRxiv (2025). https://doi.org/10.64898/2025.12.01.691649

The paper in one sentence

ecDNAInspector is a new computational framework that systematically assesses the confidence of ecDNA structural predictions from short-read sequencing data, enabling higher-quality, biologically relevant insights into cancer genomics.

Summary

This paper introduces ecDNAInspector, a tool designed to evaluate and filter predictions of extrachromosomal DNA (ecDNA) structures generated by existing inference tools like AmpliconArchitect. By integrating orthogonal structural variant calls and quality metrics such as mapping errors and breakpoint support, ecDNAInspector clusters ecDNA predictions into high, medium, and low confidence groups. Applied to a cohort of 231 breast cancers, the tool successfully identified 250 high-confidence ecDNA cycles, validated through Hi-C data and subtype-matched cell lines. The study reveals that ecDNA structural conservation is driven by oncogene inclusion and subtype-specific patterns, offering new insights into ecDNA formation and function in cancer progression.

Personal highlights

Systematic confidence assessment for ecDNA predictions: ecDNAInspector integrates orthogonal structural variant calls and quality flags, such as Mapping Error Boolean (MEB) and Extreme Cycle Size Boolean (ESB), to evaluate and cluster ecDNA cycles into high, medium, and low confidence groups, reducing false positives and enhancing reliability.
Validation through multi-modal data integration: high-confidence cycles identified by ecDNAInspector were validated using Hi-C contact maps, oncogene enrichment analysis, and experimental confirmation in breast cancer cell lines, demonstrating strong concordance between computational predictions and biological reality.
Subtype-specific structural conservation driven by oncogenes: the tool revealed that ecDNA structural conservation is predominantly driven by oncogene inclusion rather than recurrent breakpoints, with distinct co-amplification patterns observed across molecular subtypes such as IC5 and IC6.
Flexible and interpretable filtering framework: ecDNAInspector offers both unsupervised clustering and an optional weighted scoring system for confidence assignment, allowing users to prioritize specific metrics and adapt the tool to cohort-specific characteristics.
Scalable analysis of large cancer genomics cohorts: designed to work with existing short-read WGS data from cohorts like TCGA and ICGC, ecDNAInspector enables high-throughput, reproducible ecDNA structural analysis without requiring expensive long-read sequencing.

TISSUENARRATOR: Generative Modeling of Spatial Transcriptomics with Large Language Models

Liu et al. bioRxiv (2025). https://doi.org/10.1101/2025.11.24.690325

The paper in one sentence

TissueNarrator is a novel framework that reformulates spatial transcriptomics as a language modeling problem, enabling large language models to generate realistic cellular profiles, simulate perturbations, and answer natural-language queries about tissue organization.

Summary

TissueNarrator introduces a generative framework that bridges spatial biology and natural language processing by representing tissue sections as “spatial sentences”, ranked gene lists augmented with spatial coordinates and metadata. By fine-tuning large language models (LLMs) like Qwen-4B on these representations, TissueNarrator can generate context-aware cellular profiles, simulate neighborhood and genetic perturbations, and answer natural-language questions about tissue structure. Evaluated across multiple spatial transcriptomics technologies (MERFISH, Perturb-FISH, CosMx SMI), the model accurately recovers cell–cell interaction patterns, predicts immune infiltration programs in ovarian cancer, and enables interactive exploration of tissue architecture. The framework demonstrates that LLMs can effectively interpret spatial coordinates and serialized cell arrangements, offering a scalable, knowledge-driven approach to generative spatial biology.

Personal highlights

Reformulating spatial biology as a language problem: TissueNarrator converts tissue patches into “spatial sentences” by serializing cells with explicit spatial coordinates and metadata, enabling pretrained LLMs to interpret and generate spatially conditioned cellular profiles without specialized architectures.
Generative simulation of cell–cell interactions and perturbations: the model can generate realistic cell states conditioned on neighborhood context, simulate cross-regional cell transplants, and predict transcriptional responses to genetic knockouts—all through in silico perturbation modeling.
Natural-language querying of tissue organization: beyond cell generation, TissueNarrator supports conversational Q&A about spatial regions, allowing users to ask questions like “What cell types are present in this brain region?” or “Describe the function of this anatomical structure.”
Integration of biological prior knowledge with spatial context: by fine-tuning LLMs pretrained on vast textual corpora, TissueNarrator leverages embedded biological knowledge to improve generalization and interpretability, linking spatial data with natural-language explanations.

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post

Ready for more?