Weekly reads 12/1/25

Context, benchmarks and reality checks for single-cell foundation models

Jan 19, 2026

This week’s reads papers captured a moment of reckoning, and reinvention, for single-cell and spatial foundation models. On one hand, large pretrained models promise transfer learning, multimodal reasoning, and zero-shot inference; on the other, rigorous benchmarks reveal that scale alone does not guarantee superiority over classical approaches. Hou et al. provide a much-needed reality check, showing through the most comprehensive scFM benchmark to date that pretrained embeddings shine mainly in low-label regimes, while PCA remains a formidable baseline in zero-shot, spatial, and fine-tuning settings. Pushing the modeling frontier, OKR-CELL and STACK explore complementary strategies for injecting context, via open-world biological knowledge, robust cross-modal alignment, and in-context learning over cell sets, to enable generalization to unseen cell types and perturbations. Meanwhile, progress on the experimental and infrastructural fronts ensures these models have something solid to stand on: Williams-Katek et al. bridge sensitivity and discovery in spatial transcriptomics with a dual-chemistry Xenium workflow, and SCALPEL delivers an atlas-scale pipeline that turns raw spatial data into anatomically registered, analysis-ready cell maps. Together, these studies clarify where foundation models already add value, where classical methods still dominate, and what kinds of context: biological, spatial, and experimental will be required to move from impressive representations to reliable biological insight.

Preprints/articles that I managed to read this week

A Unified Framework Enables Accessible Deployment and Comprehensive Benchmarking of Single‑Cell Foundation Models

Hou et al. bioRxiv (2026). https://doi.org/10.64898/2026.01.06.698060

The paper in one sentence

This work presents a standardized, containerized framework for running and fairly comparing 13 single‑cell foundation models (scFMs), revealing through systematic benchmarking that pretrained embeddings offer clear advantages in low‑label settings, but classical PCA often remains competitive or even superior in zero‑shot, spatial, and fine‑tuning scenarios.

Summary

Single‑cell foundation models (scFMs) promise to transform how we analyze transcriptomic data, but their practical adoption has been slowed by inconsistent performance, fragmented software, and a lack of rigorous, reproducible benchmarks. To address this, the authors built a unified Nextflow‑based framework that containerizes each scFM, provides a common embedding interface, and automates large‑scale evaluation across >50 datasets. They benchmark 13 scFMs against classical baselines under zero‑shot, few‑shot, and fine‑tuning regimes, covering tasks like cell‑type clustering, trajectory inference, spatial domain detection, and perturbation prediction. Key findings include: (1) in zero‑shot settings, PCA on highly variable genes remains highly competitive, often outperforming transformer‑based models, especially on spatial data; (2) with very few labels (e.g., 1‑shot), pretrained embeddings provide meaningful gains by denoising expression signals; (3) under full fine‑tuning, performance converges across methods, and misclassifications reflect biological ambiguity rather than model‑specific failures; (4) no scFM consistently beats simple additive baselines in perturbation prediction. The study concludes that scFMs are most valuable in low‑label or transfer contexts, but scale alone does not guarantee superiority over classical approaches.

Personal highlights

Unified, containerized execution framework for scFMs: built on Nextflow, the pipeline encapsulates each model in a dedicated Docker/Singularity environment, provides a standardized AnnData interface, and enables one‑command reproducible benchmarking, drastically lowering technical barriers.
Most comprehensive scFM benchmark to date: evaluates 13 foundation models (including scGPT, Geneformer, scFoundation, SCimilarity, and UCE) alongside PCA across >50 datasets, spanning zero‑shot clustering, trajectory inference, spatial transcriptomics, few‑shot annotation, and perturbation tasks.
PCA remains a strong baseline, often superior in zero‑shot settings: across cell‑type clustering, spatial domain detection, and trajectory inference, the classical HVG+PCA pipeline frequently matched or exceeded pretrained embeddings, highlighting the enduring power of linear structure in transcriptomic data.
Pretrained embeddings shine in extremely low‑label regimes: with only one labeled cell per class, many scFMs outperformed raw gene expression, demonstrating an ability to extract denoised, biologically meaningful structure—but this advantage diminished rapidly with just five labeled cells.
No zero‑shot transfer to spatial transcriptomics without adaptation: models pretrained on scRNA‑seq failed to generalize to Visium, ST, or MERFISH data; PCA consistently outperformed all scFMs on spatial clustering, underscoring the need for modality‑aware pretraining or adaptation strategies

Why should we care?

This work provides the community with both a practical tool and a clear‑eyed reality check on single‑cell foundation models. The unified framework makes scFMs accessible and reproducible for non‑specialists, while the benchmark establishes transparent, evidence‑based guidelines for when, and when not, to use them

OKR-CELL: A Robust Single-Cell Foundation Model with Open-World Knowledge and Cross-Modal Learning

Wang et al. bioRxiv (2026). https://doi.org/10.64898/2026.01.09.699573

The paper in one sentence

OKR-CELL is a cross-modal single-cell foundation model that integrates open-world biological knowledge from large language models with a robust alignment objective to enhance cell representation learning, improve noise tolerance, and enable zero-shot and few-shot cell annotation.

Summary

OKR-CELL addresses two key limitations in current single-cell foundation models: shallow integration of cellular context and sensitivity to noise in multimodal data. The model enriches textual descriptions of cells using retrieval-augmented generation (RAG) with large language models, and introduces a novel Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and momentum contrastive learning. Pre-trained on 32 million cell-text pairs, OKR-CELL achieves state-of-the-art performance across six tasks, including cell clustering, batch-effect correction, few-shot and zero-shot annotation, and bidirectional cross-modal retrieval. It demonstrates exceptional robustness to noisy data and generalizes well to unseen cell types, setting a new benchmark for multimodal single-cell analysis.

Personal highlights

Open-world knowledge augmentation via LLM + RAG: OKR-CELL uses retrieval-augmented generation with large language models to enrich sparse cell metadata into detailed, context-rich textual descriptions, incorporating broad biological knowledge while minimizing hallucinations through reliability screening.
Noise-robust cross-modal alignment with CRA: the model introduces a Cross-modal Robust Alignment objective that explicitly handles noisy training data by weighting sample reliability, employing curriculum learning, and expanding negative sample diversity through momentum-updated memory banks.
State-of-the-art performance across six benchmark tasks: OKR-CELL outperforms existing single-cell and cross-modal models in cell clustering, type annotation, batch-effect correction, and in zero-shot and few-shot settings, demonstrating strong generalization.
Superior robustness to noisy and corrupted data: the model maintains high accuracy even under significant gene dropout and shuffled cell-text pairings, showing real-world applicability where data quality is often imperfect.
Bidirectional cross-modal retrieval for unseen cell types: OKR-CELL enables accurate cell-to-text and text-to-cell retrieval on novel cell types, highlighting its ability to learn semantically meaningful joint representations without prior exposure

Why should we care?

OKR-CELL bridges the gap between single-cell transcriptomics and natural language understanding, enabling deeper, more interpretable cell characterization. By integrating open-world knowledge and robust multimodal alignment, it moves beyond static annotations toward dynamic, context-aware cell modeling

Fishing with Two Lines: A Hybrid Approach to Spatial Transcriptomic Discovery

Williams-Katek et al. bioRxiv (2026). https://doi.org/10.64898/2026.01.07.698201

The paper in one sentence

This study presents a hybrid “dual chemistry” method that combines the high sensitivity of 10X Genomics Xenium V1 panels (480 genes) with the broad coverage of Prime 5K panels (5001 genes) on the same tissue section, enabling both targeted and discovery-driven spatial transcriptomics in the same cells.

Summary

The authors developed a novel experimental workflow to simultaneously profile single tissue sections using two complementary Xenium chemistries: a sensitive V1 custom panel (480 genes) and a broad Prime 5K panel. By co-hybridizing probes and sequentially running decoding chemistries on the same slide, they generated a unified dataset that retains high sensitivity for key markers while capturing thousands of additional genes for discovery. Applied to a human lung tissue microarray, the method showed high concordance with solo chemistry runs, retained more cells after quality filtering, and enabled integrated analyses such as secretome profiling. The approach provides a practical solution to the breadth-vs-depth trade-off in spatial transcriptomics, allowing researchers to combine hypothesis-driven and exploratory analyses within the same cellular context.

Personal highlights

Hybrid co-detection of V1 and Prime panels: the study successfully co-hybridizes and sequentially decodes two Xenium chemistries on the same tissue section, merging high sensitivity (V1) with broad gene coverage (Prime 5K) without major signal dropout.
Improved cell retention with combined data: using transcripts from both panels during quality filtering retained significantly more cells (75.5%) compared to using V1-only (59.6%) or Prime-only (47.2%) data, enhancing downstream analysis power.
Minimal interference between chemistries: Despite overlapping 239 genes, V1 and Prime showed consistent expression patterns, with only modest competitive binding effects observed in the Prime dual run, validating the technical feasibility of the hybrid approach.
Practical protocol with cross-study integration value: the method provides a stable reference (Prime 5K) for multi-study alignment while preserving the sensitivity of custom V1 panels, addressing a key challenge in spatial omics meta-analysis.

Why should we care?

This hybrid approach bridges a persistent gap in spatial transcriptomics: the trade-off between sensitivity and breadth. By allowing researchers to “fish with two lines”, simultaneously targeting known markers and capturing thousands of additional genes, it maximizes the informational yield from precious tissue samples. For disease researchers, this means more confident cell typing alongside unbiased discovery of novel signatures

STACK: In-context learning of single-cell biology

Dong et al. bioRxiv (2026). doi: https://doi.org/10.64898/2026.01.09.696608

The paper in one sentence

STACK is a single-cell foundation model that uses cellular context and in-context learning to generate accurate cell embeddings and predict perturbation effects across tissues and donors without task-specific retraining.

Summary

STACK is a transformer-based model pre-trained on 189 million human single cells that introduces in-context learning to single-cell biology. Unlike previous models, it processes sets of cells together, enabling it to learn from cellular context during inference. This allows STACK to perform zero-shot tasks, like predicting drug responses in unseen cell types or generating donor-specific expression profiles, without fine-tuning. The model also supports “cell prompting,” where users can condition predictions on example cells (e.g., perturbed immune cells) to simulate how other cell types would respond. In evaluations, STACK outperformed existing baselines in embedding quality and perturbation prediction and was used to create Perturb Sapiens, a whole-organism atlas of simulated drug and cytokine responses across 28 tissues.

Personal highlights

Cell-set modeling with dual attention: STACK processes groups of cells using both intra-cellular attention (within each cell’s gene tokens) and inter-cellular attention (across cells in a set), enabling it to capture population-level signals often missed by single-cell-only models.
Gene module tokenization: instead of treating each gene separately, STACK learns to group genes into 100 “gene module tokens” per cell, reducing dimensionality while preserving biological coherence, 75% of top genes belong to only one module.
In-context learning via cell prompting: after pre-training, STACK can be “prompted” with example cells (e.g., cytokine-treated T cells) to predict how other cell types or donors would respond, enabling zero-shot generalization to new biological contexts.
Self-distillation for conditional generation: a post-training self-distillation procedure aligns the model to act as a conditional generator, predicting expression profiles for query cells based on prompt cells without needing perturbation labels or cell-type encodings.
Whole-organism perturbation atlas: ising STACK, the authors generated Perturb Sapiens, a simulated atlas of 201 drug/cytokine responses across 28 human tissues, providing a resource for exploring perturbation effects in cell types never experimentally perturbed.

SCALPEL: A pipeline for processing large-scale spatial transcriptomics data

Kunst et al. bioRxiv (2026.. doi: https://doi.org/10.64898/2026.01.09.698732

The paper in one sentence

SCALPEL is a modular, atlas-scale processing pipeline for spatial transcriptomics that improves cell segmentation, quality filtering, spatial domain detection, and anatomical registration, applied here to a whole mouse brain dataset of 5.5 million cells.

Summary

SCALPEL (Spatial Cell Analysis, Labeling, Processing, and Expression Linking) is a comprehensive computational pipeline designed for large-scale, cellular-resolution spatial transcriptomics data. Developed and benchmarked on a 59-section whole mouse brain MERSCOPE dataset (~500 genes per section), the pipeline introduces key upgrades over previous methods: it uses a custom 3D Cellpose model trained on mRNA density images for more accurate cell segmentation, implements adaptive doublet detection and label-transfer filtering, detects spatially coherent domains using graph neural networks (STAligner), registers cells to the Allen Mouse Brain Common Coordinate Framework (CCFv3) using cell-type landmarks, and optionally imputes genome-wide expression via integration with scRNA-seq. Applied to an existing atlas, SCALPEL increased cell registration accuracy, retained more high-quality cells, and produced cleaner expression profiles—enabling more robust downstream spatial analysis and cross-modal integration.

Personal highlights

3D segmentation using mRNA density images: SCALPEL replaces traditional PolyT-based cytoplasmic staining with a synthetic image derived from binned mRNA transcripts, enabling more accurate 3D cell boundary detection in dense tissues (e.g., cerebellum) and reducing contamination from extracellular RNA.
Adaptive, cluster-aware filtering for label transfer: instead of a fixed correlation threshold for cell-type mapping, SCALPEL uses a DoubleMAD approach to set per-cluster thresholds, preserving rare cell populations (e.g., RN Spp neurons) while removing low-confidence mappings, especially beneficial for non-neuronal cells with fewer panel markers.
Spatial domain detection with cross-section alignment: using STAligner, a graph attention network, the pipeline identifies transcriptionally coherent spatial neighborhoods that extend across serial sections, generating 67 domains that correspond to known anatomical regions, from broad cortical areas to specific nuclei.
Landmark-driven registration to a brain atlas: rather than relying on low-information fluorescence channels, SCALPEL registers sections to the CCFv3 by using cell-type clusters as anatomical landmarks, achieving high alignment accuracy (Adjusted Rand Index ~0.97) and enabling integration with connectomic, morphological, and patch-seq datasets.
Optional genome-wide gene imputation: through the ENVI framework, the pipeline imputes expression for genes not included in the original 500-gene panel, recovering spatially plausible patterns for thousands of additional genes (validated by independent MERFISH data), useful for exploring ligands, receptors, and functional pathways.

Why should we care?

SCALPEL tackles a pressing need in spatial omics: standardized, scalable processing for atlas-level datasets. As efforts like BICAN, HubMap, and large-scale brain projects generate terabytes of spatial data, consistent and reproducible preprocessing is essential for cross-study comparison and meta-analysis. By improving segmentation, filtering, and registration, SCALPEL turns raw image data into cleaner, anatomically anchored cell-by-gene matrices, ready for downstream analysis like cell-cell interaction mapping, niche characterization, or differential spatial expression.

Other papers that peeked my interest and were added to the purgatory of my “to read” pile

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post

Ready for more?