Weekly reads 27/4/36

Reference-free discovery, spatial resolution, and targeting tumor states

May 03, 2026

This week’s reads higlight how much biology still sits just beyond the reach of our standard tools and how new methods are starting to uncover it. Reference-free single-cell analysis identifies novel transcriptional variance, including novel protein families even in non-model organisms, whereas innovations in spatial transcriptomics address longstanding issues such as RNA mobility and leakage. Simultaneously, novel computational models offer new insights into the study of spatial organization, ranging from the coordinated expression of genes within different cell types to subcellular localization of RNAs. From the biology perspective, one of the most interesting papers focuses on the role of mechanics of the heart as an effective but underestimated anti-cancer mechanism, whereas another one describes the development of rationally designed combination therapy for diffuse midline glioma, where the approach utilizes co-existing tumour states to achieve higher efficacy than mono-therapies.

Preprints/articles that I managed to read this week

Reference-free discovery with barcoded single-cell sequencing

Dehghannasiri et al. Nature Biotechnology (2026). 10.1038/s41587-026-03084-6

The paper in one sentence

sc‑SPLASH is a reference‑free, statistics‑first pipeline for droplet‑based single‑cell and spatial transcriptomics that discovers regulated sequence variation (including novel secreted repeat proteins missing from reference genomes) without alignment, while its BKC module preprocesses 10x data ~50× faster than UMI‑tools.

Summary

Most scRNA‑seq analyses rely on alignment to a reference genome, which biases discovery toward known genes and fails in non‑model organisms with incomplete references. The authors adapt the SPLASH framework (k‑mer‑based, statistical test for sample‑dependent sequence diversity) to barcoded 10x data. First, they develop BKC (barcoded‑read k‑mer counter), a C++ tool that extracts trusted cell barcodes, performs UMI deduplication, and counts anchor‑target k‑mer pairs – ~50× faster than UMI‑tools. Second, they build contingency tables per anchor across cells and compute a closed‑form P‑value for anchor‑target distribution heterogeneity. In human Tabula Sapiens data, sc‑SPLASH identifies cell‑type‑specific alternative splicing (e.g., RPS24, MYL6) and, after integration with IgBLAST, detects 60,697 productive V(D)J sequences across 16 tissues. On Visium spatial data, it finds a tumor‑associated double mutation in MT‑ND4 in squamous carcinoma and distinguishes keratin paralogs KRT16/KRT17. In electric eel, it detects RPS24 exon 6 inclusion in electrolytes vs. exclusion in stroma, evolutionarily conserved with humans. Crucially, in the freshwater sponge Spongilla lacustris (no complete reference), sc‑SPLASH identifies a highly diverse “granny” anchor (667 targets, entropy 6.2) absent from NCBI. Follow‑up PacBio sequencing reveals a family of five secreted repeat proteins (Granrep1‑5) expressed in granulocytes and amebocytes, immune‑responsive to LPS/cGAMP, and highly polymorphic (2‑6 alleles per gene). Similarly, in tunicate Ciona robusta, sc‑SPLASH finds a YYD repeat anchor with dozens of targets, identifying two genes composed almost entirely of 24‑bp repeats, expressed in circulating hemocytes and peaking during metamorphosis. These discoveries showcase sc‑SPLASH’s power to reveal hidden transcriptomic complexity in any organism, without a reference.

Personal highlights

Ultra‑fast, reference‑free preprocessing: BKC performs cell barcode filtering, UMI deduplication, and k‑mer counting in C++ with parallelization, running ~50× faster than UMI‑tools (165 s vs. 9,272 s on a 10x dataset) and using less memory than Cell Ranger or STARsolo. This makes large‑scale reference‑free analysis practical.
Discovery of novel secreted repeat protein families in non‑model organisms: in sponge, sc‑SPLASH identifies the “granny” anchor with 667 distinct targets, leading to the characterisation of five Granrep genes – entirely absent from the reference genome, encoding secreted proteins with imperfect 30‑bp repeats, a signal peptide, and a lysine‑rich region. These are expressed in granulocytes (immune cells) and upregulated by LPS/cGAMP, suggesting an immune function.
Cell‑type‑specific alternative splicing and V(D)J detection without alignment: sc‑SPLASH detects RPS24 alternative splicing (inclusion/exclusion of microexons) across human tissues and in electric eel electrolytes vs. stroma, and integrates with IgBLAST to assemble 60,697 in‑frame V(D)J sequences from plasma and B cells – all without relying on a pre‑aligned reference for the discovery step.
Spatial transcriptomics applications that aligners miss: on Visium data from squamous cell carcinoma, sc‑SPLASH identifies a MT‑ND4 double mutation (CC→TT) enriched in the carcinoma region, and distinguishes KRT16 vs. KRT17 paralog expression patterns. In human fetal intestine, it detects RPS24 exon 5 inclusion in epithelium vs. exclusion in stroma – a 3‑nt microexon that standard pipelines often overlook.
Robust to batch effects and scalable: because the statistical test conditions on observing the anchor, sc‑SPLASH is naturally robust to technical variation. Across donors, the overlap of significant anchor clusters in the same tissue is significantly higher than expected by chance (binomial test, P < 2.2×10⁻¹⁶), confirming biological reproducibility.

Why should we care?

For researchers working on non‑model organisms, organisms with poor or incomplete genome assemblies, or any system where reference bias is a concern, sc‑SPLASH offers a genuine alternative to alignment‑dependent workflows. It does not require a reference to discover regulated sequence variation – it works directly from raw reads. The discovery of the Granrep and YYD repeat protein families, completely missed by standard pipelines and absent from reference genomes, is a powerful proof‑of‑principle that sc‑SPLASH can uncover biology that would otherwise remain invisible. That said, the method is a discovery engine, not a fully automated annotator, the novel genes required substantial follow‑up with long‑read sequencing and manual assembly to characterise. Also, while sc‑SPLASH is computationally efficient, post‑processing (e.g., extender alignment, Pfam search) still benefits from a reference. The tool is best seen as an unbiased hypothesis generator for sequence variation (splicing, mutations, paralog usage, repetitive elements, novel genes) that can be applied to any barcoded single‑cell or spatial dataset, including clinical samples and environmental species.

SpaceBender: Denoising spatial transcriptomics data to enhance biological signals

Chen et al. bioRxiv (2026). 10.64898/2026.04.20.719715

The paper in one sentence

SpaceBender adapts a deep generative model (originally for single‑cell ambient RNA removal) to spatial transcriptomics by incorporating spatially local ambient RNA profiles, outperforming existing denoising methods on simulations and chimeric tissues, and revealing hidden biological structures such as light‑zone vs. dark‑zone follicular regions in human lymph node.

Summary

Spatial transcriptomics (ST) data suffer from RNA diffusion – transcripts physically move from their cell of origin to neighbouring spots, blurring biological signals. Existing denoising methods (SpotClean, SpaDiff) either do not fully exploit spatial context or are based on different noise models. SpaceBender builds on the CellBender framework, adding two key spatial adaptations: (1) leveraging automated tissue detection to define empty spots (background) as negative controls, and (2) estimating ambient RNA profiles from local spatial neighbourhoods rather than globally. In simulated ST data (with transcript positions perturbed by 100–1000% of spot radius), SpaceBender achieved lower root‑mean‑squared error and Jensen‑Shannon divergence than SpotClean and SpaDiff. On mouse‑human chimeric Visium data (where ground‑truth species mixing is known), SpaceBender gave higher adjusted mutual information and adjusted Rand index, indicating that denoised clusters better separate human and mouse spots. In a human lymph node Visium dataset, SpaceBender split a single follicle cluster into two biologically meaningful subclusters – the light zone (LZ) and dark zone (DZ) – with enriched pathway scores (proliferation, DNA repair) that were far more significant after denoising (e.g., proliferation p‑value from 4.29×10⁻⁵ to 4.11×10⁻¹⁶). In a melanoma Visium dataset with a known B2M‑loss subclone, SpaceBender improved separation of the subclone from other tumour spots (higher silhouette score) and increased the number of differentially expressed genes from 2 to 75 (FDR<0.05). Finally, SpaceBender extended to subcellular resolution (MERFISH, CosMx, Xenium), reducing off‑target marker expression (e.g., CD79B in non‑B cells) and decreasing apparent doublet counts (CD3D⁺CD79B⁺ cells) significantly (Fisher’s exact test p‑value 2.2×10⁻¹⁶). The method is open‑source and parameter‑robust.

Personal highlights

Spatially aware ambient RNA modeling improves denoising: unlike single‑cell methods that assume uniform background, SpaceBender computes local ambient RNA profiles per spatial neighbourhood, capturing diffusion gradients across tissue regions. This is implemented by defining empty “background” spots (using automated tissue detection) and modelling their gene expression as a spatially varying prior.
Consistently outperforms existing methods on benchmarks: on simulated data with escalating noise (100–1000% spot radius), SpaceBender achieved RMSE ≈1.88 vs. 2.47 (SpotClean) and 2.92 (SpaDiff). On mouse‑human chimeric tissues, SpaceBender gave the highest adjusted mutual information (0.11 vs. -0.04 for SpotClean, -1.27 for SpaDiff), demonstrating that denoised clusters better match true species identity.
Extends to subcellular resolution data (MERFISH, CosMx, Xenium): SpaceBender reduced off‑target expression of cell‑type markers (e.g., B‑cell marker CD79B in non‑B cells) and significantly decreased doublet‑like co‑expression of CD3D (T cells) and CD79B (B cells) in the MERFISH tonsil dataset (Fisher’s exact p‑value 2.2×10⁻¹⁶). Similar improvements were seen in CosMx NSCLC and Xenium melanoma data.

Mechanical load inhibits cancer growth in mouse and human hearts

Ciucci et al. Science (2026). 10.1126/science.ads9412

The paper in one sentence

Mechanical forces from heartbeats suppress cancer cell proliferation by activating Nesprin‑2‑mediated mechanotransduction, which reduces histone H3K9 trimethylation and decompacts chromatin at growth‑regulatory loci, explaining why the heart is remarkably resistant to both primary and metastatic cancers.

Summary

The heart is rarely affected by cancer, a puzzling fact given its high blood flow and constant perfusion, which should favour metastasis. The authors hypothesised that the same mechanical forces that stop cardiomyocyte proliferation after birth might also inhibit cancer cells. Using a heterotopic heart transplantation model in mice (where a donor heart is surgically connected to neck vessels, restoring blood flow but removing left‑ventricular load), they found that lung cancer cells injected into unloaded hearts grew dramatically larger tumours than those in normally loaded hearts – not due to better initial engraftment, but due to increased proliferation. Engineered heart tissues (EHTs) with adjustable mechanical load confirmed the effect: unloading promoted cancer cell growth, while overloading suppressed it. In human cardiac metastases from three different primary tumours (lung, colon, melanoma), spatial transcriptomics revealed a common transcriptional signature in cardiac lesions, with strong up‑regulation of histone demethylases and reduced H3K9me3 and chromatin compaction compared to matched extracardiac metastases. Mechanistically, mechanical load acts through Nesprin‑2, a linker of nucleoskeleton and cytoskeleton (LINC) complex protein. Silencing Nesprin‑2 in cancer cells restored their ability to proliferate in loaded hearts and EHTs, increased H3K9me3 and chromatin compaction, and abolished the growth‑suppressive effect of mechanical load. The study links physical forces to epigenetic regulation of cancer cell proliferation, identifying a previously unrecognised tumour‑suppressive mechanism unique to the heart.

Personal highlights

The heart actively suppresses cancer growth via mechanical load: in a genetically engineered mouse model (K‑RasG12D; p53‑/‑), tumours developed in liver, skeletal muscle and other organs – but never in the heart, despite comparable oncogene activation. Heterotopic transplantation showed that unloading the heart dramatically increased the growth of injected lung cancer cells, proving that mechanical load (not just blood flow or immune surveillance) is the key protective factor.
Cardiac metastases share a conserved transcriptional signature, independent of primary tumour type: spatial transcriptomics of human cardiac metastases (from lung, colon and melanoma) revealed that cancer cells in the heart up‑regulate histone demethylases (e.g., KDM4C, KDM4D) and have reduced H3K9me3 and less compact chromatin compared to matched extracardiac lesions. This signature was not seen in primary tumours or other metastases, suggesting that the heart mechanically reprograms cancer cells.
Nesprin‑2 is the essential mechanosensor: silencing of Nesprin‑2 (but not other LINC complex proteins) in lung, colon and melanoma cells completely reversed the growth‑suppressive effect of mechanical load. Nesprin‑2‑silenced cancer cells grew as large tumours in loaded hearts, with increased H3K9me3 and chromatin compaction, demonstrating that Nesprin‑2 transmits mechanical forces into epigenetic changes that inhibit proliferation.
Chromatin accessibility and H3K9me3 are dynamically regulated by load: ATAC‑seq and ChIP‑seq on cancer cells harvested from loaded vs. unloaded hearts showed that mechanical load increases chromatin accessibility at loci involved in cell‑cycle arrest, mechanosensing and calcium homeostasis, while reducing H3K9me3 at those same regulatory regions. The effects were recapitulated in EHTs and were dependent on Nesprin‑2.
Potential therapeutic implications: although the heart’s high mechanical load is unique, the study raises the possibility that artificial mechanical stimulation (e.g., via external devices) might be explored to suppress cancer growth in other tissues. More immediately, the work explains why cardiac metastases are rare and small, and it identifies the Nesprin‑2–H3K9me3 axis as a plausible target for preventing or treating cardiac metastases.

Why should we care?

This paper elegantly solves a long‑standing medical curiosity: why does the heart, a highly vascularised, constantly perfused organ, almost never get cancer? The answer is not that cancer cells cannot reach the heart, but that the relentless mechanical beating creates a hostile environment that stops them from proliferating. The discovery of Nesprin‑2 as the key force sensor that translates physical strain into an epigenetic brake on cell division is satisfying at a basic science level and opens new questions about how other tissues with distinct mechanical properties (e.g., skeletal muscle, bone, blood vessels) might also suppress or promote cancer. However, the translational relevance is limited, you cannot easily “mechanically load” a metastatic deposit in the liver or lung without causing damage. The study also does not test whether existing heart failure patients with reduced cardiac output (lower load) have a higher incidence of cardiac metastases.

Systematic design of combination therapy by targeting master regulators of coexisting diffuse midline glioma cell states

Calvo Fernández et al. Nature Genetics (2026). 10.1038/s41588-026-02550-w

The paper in one sentence

A network-based framework that infers master regulator proteins from single‑cell RNA‑seq data identifies seven coexisting cell states in diffuse midline glioma (DMG) and predicts clinically actionable drug combinations that target complementary states, with avapritinib plus ruxolitinib nearly tripling median survival in mice.

Summary

Diffuse midline glioma (DMG) is a universally fatal pediatric brain tumour driven by non‑actionable histone mutations and characterised by extensive intratumoural heterogeneity. The authors developed a generalisable, mutation‑agnostic strategy to design combination therapies targeting coexisting cell states. Using single‑cell RNA‑seq from 14 DMG patients and protein activity inference (metaVIPER), they resolved seven malignant cell states (oligodendrocyte precursor cell (OPC)-like, oligodendrocyte (OC)-like, astrocyte (AC)-like) with distinct master regulator (MR) proteins. Pooled CRISPR‑Cas9 screens validated that these MRs represent functional dependencies, with FOXM1 being a conserved essential gene. They then profiled 372 clinically relevant drugs (FDA‑approved or late‑stage) in two DMG cell lines using PLATE‑seq, generating transcriptional perturbation signatures that reveal DMG‑specific mechanisms of action. The OncoTarget (targeting individual MRs) and OncoTreat (inverting the activity of the top 50 MRs) algorithms predicted drugs predicted to selectively deplete each cell state. In a subcutaneous xenograft model that recapitulates all seven human cell states, 8 out of 9 predicted drugs selectively depleted their target states (e.g., avapritinib depleted OPC states; ruxolitinib depleted AC states). In an orthotopic syngeneic model, OPC‑targeting monotherapies (avapritinib, trametinib, dinaciclib) modestly improved survival, but AC‑targeting drugs had little effect alone. However, combinations targeting complementary OPC and AC states significantly outperformed monotherapies: avapritinib + ruxolitinib extended median survival to 83 days vs. 25 days (vehicle) and 53.5 days (avapritinib alone). The synergy was not cell‑autonomous (in vitro additive) but reflected co‑depletion of distinct cell states in vivo. The framework also predicted drug efficacy in three patient biopsies, with predictions being twice as likely to be effective. The study establishes a tumour‑agnostic, mechanism‑based pipeline for rational combination therapy design in heterogeneous cancers.

Personal highlights

Seven conserved DMG cell states resolved by protein activity, not just gene expression: using metaVIPER to infer regulatory protein activity from single‑cell RNA‑seq, the authors identified seven recurrent malignant states (OPC, OPCC, OPCQ, OC, OPC/OC, AC, OPC/AC) across 14 patients, each driven by distinct master regulators. This goes beyond conventional transcriptomic clustering and captures functional regulatory programmes.
CRISPR screens validate master regulators as essential dependencies: pooled knockout screens targeting all transcription factors in three genetically distinct DMG cell lines showed that VIPER‑inferred tumourigenic MRs are significantly enriched in essential genes. FOXM1 emerged as the most conserved dependency across states, and several other MRs (e.g., DLX1, SOX10) are known H3K27me3 targets that become de‑repressed by H3K27M mutations.
Large‑scale drug perturbation profiling defines DMG‑specific mechanisms of action: PLATE‑seq transcriptomic profiling of 372 oncology compounds in two high‑fidelity DMG cell lines generated proteome‑wide activity signatures. Drugs with unrelated primary targets often converged on similar DMG‑specific MoA profiles, enabling the OncoTreat algorithm to predict which drugs would invert the activity of cell‑state‑specific tumourigenic MRs.
In vivo validation: 8/9 drugs selectively deplete predicted states, but monotherapies targeting minority states fail: in a DIPG17 subcutaneous xenograft that preserved all seven human cell states, five OPC‑targeting drugs (avapritinib, trametinib, dinaciclib, etc.) specifically depleted OPC(/OPCC) states, while three of four AC‑targeting drugs (ruxolitinib, venetoclax, larotrectinib) depleted AC(/OPC/AC) states. However, in an orthotopic pontine model, only OPC‑targeting monotherapies modestly improved survival; AC‑targeting drugs alone had no benefit – consistent with AC states being a minority population.
Combinations targeting complementary cell states dramatically extend survival where monotherapies fail: Avapritinib + ruxolitinib extended median survival to 83 days vs. 25 days (vehicle) and 53.5 days (avapritinib alone); trametinib + ruxolitinib (45.5 vs. 28 days) and dinaciclib + ruxolitinib (48 vs. 30 days) also significantly outperformed monotherapies. Bliss independence assays in OPC‑dominant cell lines showed additive, not synergistic, effects – proving that in vivo synergy arises from co‑depletion of distinct cell states, not from cell‑autonomous drug interaction.

Why should we care?

This work provides a blueprint for moving beyond empirical, cell‑line‑based drug synergy testing to a mechanism‑driven, cell‑state‑resolved combination strategy. The key insight is that in a heterogeneous tumour, effective combination therapy does not require two drugs that kill the same cell better together, it requires two drugs that kill different cell populations that coexist within the same tumour. For DMG, a disease with <10% two‑year survival and no effective medical therapy, the clinically actionable combinations identified (avapritinib + ruxolitinib, trametinib + ruxolitinib, avapritinib + larotrectinib) are all FDA‑approved or late‑stage compounds, positioning them for rapid clinical translation. The framework itself is tumour‑agnostic and mutation‑agnostic: it requires only single‑cell or bulk RNA‑seq from patient tumours and a pre‑computed library of drug perturbation profiles. This could be applied to any heterogeneous cancer where coexisting cell states drive therapeutic resistance. Limitations include the reliance on in vivo models that may not fully capture human tumour microenvironment complexity, and the fact that the survival benefits, while impressive, are still modest (83 days median – a ~3‑fold extension but not cure)

CoPro: Dissecting the coordinated progression of cell states in spatial transcriptomics

Miao et al. bioRxiv (2026). 10.64898/2026.04.17.719309

The paper in one sentence

CoPro is a computational framework that uses spatial kernel‑restricted canonical correlation analysis to detect multiple, overlapping, continuous gene expression gradients that progress in a coordinated manner across different cell types in spatial transcriptomics data.

Summary

Spatial transcriptomics allows us to see where genes are expressed, but most analysis methods discretise tissues into distinct “neighbourhoods” or recover only a single dominant gradient. CoPro takes a different approach: it models tissue organisation as a superposition of continuous axes of coordinated variation across cell types. The core idea is a spatial kernel‑restricted CCA (skrCCA): for two or more cell types, CoPro finds linear combinations of genes (cell‑type‑specific “progression scores”) that maximise their correlation after weighting by spatial proximity (a Gaussian kernel). This captures how the molecular states of neighbouring cells change together along a shared spatial axis. CoPro can operate in unsupervised mode (discovering axes de novo) or supervised mode (using a known trajectory in one cell type to find coupled programs in others). It can also transfer learned gene weights to new samples, enabling cross‑sample comparison without spatial registration. Through simulations and four real datasets (colon injury, brain striatum, aging liver, kidney), the authors show that CoPro resolves orthogonal gradients (e.g., crypt morphology vs. inflammation in injured colon; dorsal‑ventral vs. medial‑lateral in brain striatum), recovers known zonation in liver and kidney from histology‑imputed data, and quantifies the breakdown of tissue organisation during aging.

Personal highlights

Spatial kernel‑restricted CCA captures cross‑type coordination at single‑cell resolution: unlike methods that bin cells into grids or discrete neighbourhoods, CoPro operates directly on pairwise spatial distances via a Gaussian kernel. This preserves rare cell types and avoids arbitrary discretisation, while the kernel bandwidth is automatically selected from the data.
Decomposes multiple overlapping spatial gradients: many tissues contain several biological processes superimposed in the same space (e.g., a differentiation gradient plus a patchy inflammatory response). CoPro iteratively finds orthogonal axes of coordinated progression, separating these processes into distinct, interpretable components – a capability lacking in most existing spatial methods.
Supervised mode for hypothesis‑driven discovery: Given a known or inferred spatial trend in one cell type (e.g., tubular epithelial ordering along the corticomedullary axis), CoPro identifies gene programs in other cell types (e.g., vascular endothelium) that co‑vary with it. This makes the framework useful for targeted biological questions.
Cross‑sample axis transfer without spatial registration: By fixing gene weights learned from a reference sample, CoPro projects new samples onto the same biological axis (e.g., a “disease progression” score). This enables direct comparison of cell states across samples or conditions without aligning tissue morphology

Resolving sensitivity, specificity and signal contamination in Xenium spatial transcriptomics

Bilous et al. Nature Methods (2026). 10.1038/s41592-026-03089-8

The paper in one sentence

Analysis of 41 breast and lung tumour sections reveals that Xenium spatial transcriptomics data suffer from substantial transcript spillover between neighbouring cells, and the authors introduce SPLIT, a reference‑based computational method that decomposes mixed signals to improve cell‑type purity and reveal biologically relevant signatures such as T‑cell exhaustion.

Summary

This study provides one of the largest Xenium datasets to date (41 sections from 27 donors, both breast and lung cancer) and systematically evaluates key performance characteristics: sensitivity (transcript detection), specificity (spillover contamination), panel design (targeted vs. 5K Prime), and segmentation strategies. The authors show that targeted panels (e.g., Lung panel) have higher per‑gene sensitivity than the broader 5K panel, despite detecting fewer total genes. They demonstrate that transcript spillover – where transcripts from one cell are incorrectly assigned to a neighbour – is pervasive, particularly affecting low‑RNA cells like T cells, and correlates strongly with local abundance of the contaminating cell type (e.g., malignant cells). Using RCTD doublet mode, they quantify contamination as a secondary cell‑type weight. They then introduce SPLIT (Spatial Purification of Layered Intracellular Transcripts), which uses the RCTD weights and reference profiles to decompose each cell’s expression into primary and secondary components, effectively removing contaminating signal. SPLIT outperforms other correction methods (ResolVI, ovrlpy) in preserving gene detection, improving cell‑type separation, and recovering biological signals – notably, after SPLIT correction, T cells near malignant cells show clear exhaustion signatures (HAVCR2, CTLA4, PDCD1, LAG3, CXCL13) that were obscured by spillover. SPLIT is deconvolution‑agnostic (works with any reference‑based method) and can be combined with alternative segmentation algorithms (e.g., ProSeg) for further gains.

Personal highlights

Transcript spillover is widespread and quantifiable: using RCTD’s doublet mode, the authors show that a cell’s secondary contamination weight correlates strongly with the local abundance of the contaminating cell type (e.g., malignant cells). This spillover disproportionately affects low‑RNA‑content cells (e.g., T cells) and can lead to misannotation (e.g., a CD8+ T cell called as malignant).
Targeted panels outperform the 5K panel in per‑gene sensitivity: while the 5K panel detects more total transcripts, targeted panels show higher sensitivity per gene, better cell‑type separation, and fewer QC failures. About 60% of 5K cells fail QC due to low transcript counts, a crucial trade‑off for users choosing panels.
SPLIT improves signal purity without over‑correction: unlike methods that reduce total gene counts or distort expression, SPLIT uses a simple, interpretable scaling factor based on reference profiles. It retains more cells and genes while significantly reducing contamination, as measured by cosine similarity to matched snRNA‑seq reference profiles and by removal of malignant marker genes from T cells.

Why should we care?

For researchers using imaging‑based spatial transcriptomics, this works provides essential guidance: transcript spillover is real, it affects downstream biological conclusions (e.g., cell‑cell communication, exhaustion), and it can be corrected. The comparison of targeted vs. 5K panels gives practical advice: if sensitivity for specific genes matters, targeted panels are better; if you need broad discovery, accept lower per‑gene sensitivity. SPLIT is a practical, open‑source tool that integrates with existing annotation pipelines (RCTD) and works with any segmentation. It does not require raw transcript coordinates or complicated spatial models, making it easy to adopt. However, SPLIT depends on a good reference dataset; missing cell types can lead to artefacts. Also, the validation is limited to two cancer types, and the IHC validation only worked on five samples

SubCellSpace: Automated characterization of subcellular mRNA localization patterns in spatial transcriptomics

Wouters et al. bioRxiv (2026). 10.64898/2026.04.28.720613

The paper in one sentence

SubCellSpace is a convolutional variational autoencoder that learns a general, interpretable latent space of subcellular mRNA localization patterns from imaging‑based spatial transcriptomics data, enabling automated detection of non‑randomly localised transcripts, pattern classification, and unsupervised exploration of colocalisation and cellular heterogeneity.

Summary

Until recently, studying subcellular RNA localisation at scale was impossible. Imaging‑based spatial transcriptomics (MERFISH, Xenium) now provide single‑molecule resolution, but computational tools for automated pattern discovery are lacking. Many existing methods rely on hand‑crafted features (distance to nucleus), assume unrealistically high transcript counts, or cannot handle the heterogeneity of real data. SubCellSpace takes a different approach: it converts each cell‑gene observation into a 100×100 pixel image (gaussian‑blurred transcript positions plus a nuclear mask) and trains a convolutional variational autoencoder (CVAE) on a large simulated dataset of nine pattern types (random, intranuclear, extranuclear, perinuclear, cell‑edge, pericellular, nuclear‑edge, protrusion, foci). The encoder compresses each image into a 15‑dimensional latent space that separates patterns by type while being robust to cell shape and orientation. A classifier trained on this latent space assigns a pattern‑probability score to each observation. To determine whether a gene is significantly localised across a cell population, SubCellSpace compares the distribution of these scores (over all cells expressing that gene) to a null distribution generated by shuffling transcript positions within each cell, using a Kolmogorov‑Smirnov test and Earth Mover’s Distance as an effect size. The method was validated on a novel Xenium dataset of HEK293T cells targeting 220 genes with known subcellular compartment assignments from APEX‑seq (precision 0.99, recall 0.30 at stringent threshold). Applied to mouse small‑intestine MERFISH data, SubCellSpace correctly identified 13 of 19 known apical‑basal polarised genes (F1 0.79) and, remarkably, used the latent space to infer the orientation (left/right) of enterocytes from the localisation pattern of Apob alone.

Personal highlights

Learns a general, interpretable latent space from simulated patterns: The CVAE is trained on 9 pattern classes simulated across 317 cell shapes, with Gaussian blur to handle sparse transcripts (10–100 spots per cell). The resulting 15‑dimensional embedding separates pattern types (silhouette 0.263), is robust to cell identity and rotation, and generalises to unseen patterns (e.g., protrusion maps to a distinct cluster) and real data without retraining.
Automated pipeline for pattern detection and quantification: SubCellSpace includes an end‑to‑end processing pipeline that (re)segments cells, generates per‑cell‑gene images, and computes embeddings. A random forest classifier then produces a pattern‑probability per observation. The per‑gene test distribution is compared to a spatially shuffled null using a Kolmogorov‑Smirnov test, with Earth Mover’s Distance as an effect size (thresholds 0.03 lenient, 0.06 stringent). This controls false discovery rate (precision 0.99 at stringent threshold).
Validated on novel APEX‑seq‑guided Xenium dataset: The authors generated a bespoke Xenium dataset of HEK293T cells targeting 220 genes (including 170 with known compartment assignments from APEX‑seq and 50 controls). SubCellSpace achieved F1 0.46 (stringent) with precision 0.99, correctly separating nucleus‑associated from cytosolic/ER‑membrane patterns. This is the first publicly available benchmarking resource for subcellular localisation in imaging‑based ST.
Unsupervised exploration reveals colocalisation and cellular orientation: Beyond supervised classification, the latent space enables unsupervised tasks. Genes with similar patterns (e.g., colocalising transcripts) cluster together, and the embedding captures subtle variations such as the left‑right orientation of polarised enterocytes. Using only the apical marker Apob, SubCellSpace could infer the apical/basal direction of other genes, recovering the known polarity of 83% of polarised genes.

Why should we care?

Subcellular mRNA localisation is a critical but understudied layer of gene regulation, yet systematic discovery has been limited by the lack of scalable, automated methods. The SubCellSpace approach provides an insightful and interpretable methodology that translates the output of MERFISH/Xenium-based spatial transcriptomics into an analytical, statistical localization classification. The ability to learn on simulated data and generalize on experimental data without further training is its strong point. The newly created APEX-seq-driven Xenium dataset will serve as a useful benchmark for further methods’ development. It should be noted that there are certain drawbacks to the SubCellSpace methodology; for example, it is not effective at classifying “foci” localization patterns because of the Gaussian blur. Furthermore, the latent space is not disentangled (orientation and localization pattern type share the same dimensions), and eight to ten cells per gene are required for the reliable detection of localization patterns. However, despite its disadvantages, the SubCellSpace methodology is an excellent starting point for genome-wide investigation of mRNA localization processes.

Other papers that peeked my interest and were added to the purgatory of my “to read” pile

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post

Ready for more?