Weekly reads 5/1/26

Standardizing, disentangling, and interpreting cells in space

Jan 11, 2026

This week’s reads converge on a shared ambition in single-cell and spatial biology: turning increasingly complex data into stable, comparable, and mechanistically interpretable representations. REMO proposes a universal feature space for scATAC-seq, enabling atlas-scale chromatin accessibility analysis without dataset-specific peak calling, while SpaceTrooper reframes quality control in imaging-based spatial omics as a data-driven, unified scoring problem. Others focus on extracting signal from context: Haruka disentangles perturbation-specific tissue remodeling from conserved spatial architecture, SpacePhenotyper transfers clinically relevant phenotypes from bulk cohorts onto spatial maps, and CellWHISPER introduces a confounder-aware statistical framework for inferring direct cell–cell communication. At the interface of perturbation and representation learning, scPT-seq closes the genotype–phenotype loop by directly sequencing CRISPR edits alongside transcriptomes in vivo, while CONCORD shows that careful sampling, not model complexity, is sufficient to learn coherent, topology-preserving cell-state landscapes at scale.

Preprints/articles that I managed to read this week

Regulatory element modules as universal features for single-cell chromatin analysis

Lim et al. (2025). bioRxiv. DOI: 10.64898/2025.12.10.692786

The paper in one sentence

REMO (regulatory element modules) provides a universal, feature-reduced set of DNA accessibility regions for single-cell ATAC-seq analysis, enabling scalable, reproducible, and biologically interpretable chromatin profiling across datasets.

Summary

Single-cell ATAC-seq analysis is hindered by the lack of standardized genomic features, requiring dataset-specific peak calling that prevents cross-study comparison and scalability. The authors introduce REMO,a comprehensive set of 340,069 regulatory element modules for the human genome, derived from integrating ENCODE CREs, chromatin co-accessibility, Hi-C contact data, and biochemical activity profiles. REMO reduces feature sparsity, improves cel l-state separation in low-dimensional embeddings, and enables efficient quantification via their new tool fragtk. Across 16 healthy and diseased tissues, REMO matches or outperforms peak-based analysis in cluster separation metrics while drastically lowering computational costs. The framework also supports automated cell-type annotation through Cell Ontology term enrichment, bridging the interpretability gap in chromatin accessibility data.

Personal highlights

Co-accessibility-driven module definition: REMO groups cis-regulatory elements not just by genomic proximity, but by correlated accessibility patterns across hundreds of cell types and physical chromatin contacts, capturing functionally related regulatory units.
Universal feature set for cross-dataset analysis: by providing a fixed set of ~340k modules for hg38, REMO eliminates dataset-specific peak calling, enabling direct integration and comparison of scATAC-seq datasets without reprocessing raw data.
Scalable quantification with fragtk: the authors introduce fragtk, a memory-efficient Rust-based tool for fragment counting that outperforms existing software in speed and scalability, especially when quantifying millions of regions across >1 million cells.
Automated annotation via Cell Ontology enrichment: each REMO module is pre-annotated with cell-type–specific accessibility profiles, enabling label-free cell-type prediction through enrichment testing, much like marker gene analysis in transcriptomics.

Why should we care?

REMO solves a foundational bottleneck in single-cell epigenomics: the absence of a common coordinate system for chromatin accessibility. Without it, every scATAC-seq study operates in isolation—re-inventing features, hindering integration, and bloating computational costs. By providing a universal, biologically informed feature set, REMO enables true atlas-scale epigenomics, where datasets across labs, conditions, and time can be combined, compared, and reinterpreted.

Haruka: A Spatial Contrastive Learning Framework to Decipher Perturbation Responses in Tissue Niches

Cui et al. bioRxiv (2025). https://doi.org/10.64898/2025.12.08.693051

The paper in one sentence

Haruka is a spatially aware contrastive learning framework that disentangles condition-specific changes from conserved tissue architecture in spatial omics data, enabling precise mapping of perturbation responses across diverse biological and clinical contexts.

Summary

Haruka integrates contrastive variational inference with microenvironment reconstruction to learn two embeddings per cell: a salient (condition-specific) representation and a background (shared) representation. By explicitly modeling spatial context and contrasting conditions, Haruka identifies spatially coherent domains that capture how tissues remodel in response to perturbations, such as disease, treatment, or aging. The framework outperforms existing methods in detecting heterogeneous spatial responses and has been successfully applied to melanoma immunotherapy, lung fibrosis, and KRAS-mutant lung cancer, revealing microenvironmental mechanisms of treatment response and resistance.

Personal highlights

Spatially aware contrastive disentanglement: Haruka uniquely combines contrastive variational inference with spatial niche reconstruction, learning separate latent embeddings for perturbation-specific (salient) and conserved architectural (background) features, enabling clear separation of signal from context.
Robust detection of fragmented and discontinuous domains: unlike methods that penalize spatial fragmentation, Haruka captures biologically meaningful salient domains even when they are spatially discontinuous, as demonstrated in real-world aging brain datasets.
Microenvironmental decoding of immunotherapy response: applied to melanoma CODEX data, Haruka identified four recurrent immune niches that stratify responders vs. non-responders, revealing that effective response can arise from either T-cell-dense hubs or stromal-integrated immune harmony.
Cross-condition alignment and transition mapping: in lung fibrosis, Haruka’s background embeddings improved cross-tissue alignment using SLAT, enabling precise tracking of fibroblast transitions from alveolar to activated states and revealing early transcriptional priming before morphological changes.
Intra-condition heterogeneity in drug response: in KRAS-inhibitor-treated lung cancer, Haruka uncovered spatially distinct resistant niches characterized by sustained MAPK signaling, coordinated immunosuppressive networks, and macrophage differentiation trajectories

Why should we care?

Haruka shifts the focus in spatial omics from where genes or proteins are expressed to how spatial niches change under perturbation. By explicitly separating condition-specific effects from stable tissue architecture, it allows researchers to pinpoint microenvironmental drivers of treatment resistance, disease progression, and therapeutic success.

SpaceTrooper: A Data-Driven Quality Control Framework for Imaging-Based Spatial Omics

Banzi et al. bioRxiv (2025). https://doi.org/10.64898/2025.12.24.696336

The paper in one sentence

SpaceTrooper is a data-driven quality control framework that integrates morphological and expression features into a single per-cell quality score, enabling systematic identification and removal of low-quality cells in imaging-based spatial omics data without relying on arbitrary thresholds.

Summary

As imaging-based spatial omics technologies scale, the lack of standardized quality control (QC) has become a major bottleneck. Current methods often borrow fixed thresholds from dissociated single-cell workflows, failing to capture complex, spatially-derived artifacts like segmentation errors, tissue necrosis, out-of-focus imaging, and border distortions. SpaceTrooper addresses this by computing a unified Quality Score (QS) for each cell, integrating cell size, signal density, background noise, and (where relevant) border effects into a regularized logistic regression model. The framework is platform-agnostic, working seamlessly with CosMx, Xenium, and MERFISH data across RNA and protein modalities.

Personal highlights

Data-driven integration of multi-modal QC signals: instead of applying fixed thresholds to individual metrics, SpaceTrooper learns a unified Quality Score (QS) by integrating cell size, signal density, background noise, and spatial distortion features through a regularized logistic regression model, adapting to each dataset’s specific noise profile.
Detection of spatially coherent artifact regions: the framework identifies not only isolated low-quality cells but also spatially contiguous zones of poor quality, such as necrotic cores, out-of-focus areas, FOV border truncations, and tissue detachment, that reflect localized experimental failures and are often missed by single-metric filters.
Platform- and modality-agnostic design: SpaceTrooper operates seamlessly across major imaging platforms (CosMx, Xenium, MERFISH) and both RNA and protein modalities, automatically excluding irrelevant components (e.g., border effects for Xenium/MERFISH) and maintaining stable performance regardless of gene panel size or tissue type.
Superiority over conventional fixed-threshold filtering: compared to standard combined filtering (e.g., total counts, control probe ratio, cell area), SpaceTrooper’s QS-based approach removed a broader and more relevant set of low-quality cells, better resolved artifactual mixing, and improved cluster homogeneity without introducing cell-type bias.

Why should we care?

Quality control is the foundation of any robust spatial omics analysis, yet current approaches are often subjective, platform-specific, and blind to spatially structured artifacts. SpaceTrooper shifts QC from a manual, threshold-based chore to a reproducible, data-driven workflow. By integrating multiple sources of technical variation into a single, interpretable score, it ensures that low-quality cells, whether from segmentation errors, tissue damage, or imaging artifacts, are systematically removed before they distort clustering, trajectory inference, or cell-cell communication analysis.

SpacePhenotyper: Connecting spatial regions to clinical phenotypes by transferring knowledge from bulk patient data

Amgalan et al. 2025. bioRxiv. doi:10.64898/2025.12.12.693322

The paper in one sentence

SpacePhenotyper is a computational method that transfers clinically relevant phenotype predictions, like hazard and drug response, from bulk patient transcriptomics to spatially resolved transcriptomics data, revealing how tumor heterogeneity influences clinical outcomes at the spatial level.

Summary

SpacePhenotyper addresses a major gap in spatial transcriptomics: while current methods can identify cell types or molecular features in tissue sections, none can directly assign clinical phenotypes, such as patient survival risk or treatment response, to spatial regions. The authors introduce a novel spectral transfer-learning approach that first learns a phenotype predictor from bulk RNA-seq data (the Eigen-Gene), transforms it into a gene-weight vector (the Eigen-Patient), then uses cosine similarity to map relative phenotype quantities onto each spatial spot in SRT data. The method is validated on simulated data, breast cancer SRT datasets, and pathologist-annotated H&E images, showing high accuracy in distinguishing invasive vs. in situ regions and revealing intra-tumor heterogeneity in residual cancer burden and drug response.

Personal highlights

Knowledge transfer from bulk to spatial domains via spectral learning: SpacePhenotyper uses singular value decomposition to extract principal directions from bulk expression data, constructs an Eigen-Gene predictor optimized for clinical phenotypes, and projects it into gene space as an Eigen-Patient, a reference vector that generalizes the concept of marker genes to a weighted signature of phenotype relevance.
Cosine similarity as a robust phenotype quantifier: instead of relying on per-gene correlation thresholds, the method scores each spatial spot by cosine similarity between its expression profile and the Eigen-Patient, ensuring genes most predictive of the phenotype contribute most to the score, improving robustness to noise and cross-dataset variation.
Validation against pathologist annotations in breast cancer: using H&E-stained images annotated for ductal carcinoma in situ (DCIS), microinvasion, and invasive ductal carcinoma (IDC), SpacePhenotyper accurately assigns higher hazard scores to IDC regions and lower scores to DCIS, aligning with clinical expectations and providing spatial phenotype maps consistent with histopathology.
Uncovering spatial heterogeneity in treatment response: applied to residual cancer burden (RCB) after neoadjuvant chemotherapy, SpacePhenotyper reveals that RCB correlates positively with hazard and tumor purity but negatively with immune activity, and that these associations vary between in situ and invasive regions, highlighting the importance of spatially aware clinical phenotyping.

scPT-seq: Direct detection of CRISPR mutations and transcriptional responses at single-cell resolution in vivo

Hawkins et al. 2025. bioRxiv. doi:10.64898/2025.12.23.696319

The paper in one sentence

scPT-seq is a single-cell RNA assay that directly sequences CRISPR-induced mutations at base-pair resolution alongside whole-transcriptome profiles from the same cell, enabling haplotype-resolved genotype–phenotype mapping, clonal tracing, and spatial analysis in complex tissues in vivo.

Summary

scPT-seq (Single-Cell Perturbation and Transcriptome sequencing) integrates droplet-based scRNA-seq with targeted long-read sequencing of edited loci from the same cells. After cDNA synthesis, nested PCR enriches CRISPR-targeted regions, which are sequenced on PacBio HiFi to recover full haplotypes, mutation spectra (indels, substitutions, splice variants), and cell barcodes. A custom computational pipeline phases reads, calls mutations per chromosome, and links genotypes to expression profiles. Applied to the Drosophila midgut, scPT-seq revealed diverse editing outcomes, distinguished cell-autonomous from environmental transcriptional changes, enabled dosage-response analysis across cell types, and used unique mutation combinations as heritable clonal barcodes to trace lineages and map spatial identities of intestinal stem cells. The method overcomes key limitations of guide-based or computational perturbation inference, providing direct, base-level genotype–phenotype linkage in complex tissues.

Personal highlights

Haplotype-resolved mutation detection via targeted long-read sequencing: scPT-seq couples standard scRNA-seq with PacBio HiFi sequencing of enriched CRISPR-targeted amplicons, enabling phasing of maternal and paternal alleles, detection of complex edits (indels, splice variants), and disambiguation of true monoallelic edits from allelic dropout—critical for accurate genotype assignment in single cells.
Direct genotyping overcomes confounding in differential expression analysis: by identifying wild-type cells within perturbed tissues, scPT-seq separates environmentally driven transcriptional changes (e.g., stress responses) from mutation-specific effects, revealing that bulk perturbed-vs-control comparisons can obscure true biological signals
Revealing cell-type and spatial heterogeneity in editing outcomes and dosage responses: in the Drosophila gut, scPT-seq showed that editing efficiency and mutation spectra vary by cell type and region, and uncovered spatially distinct compensatory transcriptional programs (e.g., proteasome upregulation) in response to monoallelic vs. biallelic Prosα3 loss.
Lineage tracing via unique mutation combinations as clonal barcodes: editing outcomes serve as heritable markers, allowing reconstruction of clonal relationships and spatial registration of transcriptionally similar progenitor cells, enabling, for the first time, mapping of anterior vs. posterior intestinal stem cell identities and their region-specific responses to perturbation.
Detection of splicing alterations and structural predictions from RNA-level edits: scPT-seq identifies CRISPR-induced splice-junction changes and intron retention directly from RNA, and couples prevalent mutations (e.g., an in-frame 12 bp deletion in Prosα3) with AlphaFold2 structural modeling to propose mechanistic hypotheses about functional consequences.

Why should we care?

scPT-seq closes a major gap in functional genomics: it directly links exact CRISPR edits to genome-wide transcriptional responses in individual cells within native tissue contexts. For researchers studying development, cancer, or regeneration, this means no longer guessing whether a cell was actually edited or what the edit was, you can now read out the mutation and its molecular consequences simultaneously. By revealing hidden heterogeneity in editing outcomes, disentangling environmental from genetic effects, and enabling spatial lineage tracing, scPT-seq turns pooled in vivo CRISPR screens from a black box into a high-resolution, mechanistic discovery platform

Revealing a coherent cell-state landscape across single-cell datasets with CONCORD

Zhu et al. Nature Biotechnology (2026). https://doi.org/10.1038/s41587-025-02950-z

The paper in one sentence

CONCORD is a unified, self-supervised contrastive learning framework that simultaneously integrates batches, denoises data, and preserves complex biological structures, such as lineages, loops, and trajectories, across single-cell datasets, using only a minimalist neural network and principled minibatch sampling.

Summary

CONCORD transforms contrastive learning for single-cell genomics by redesigning how minibatches are sampled. Instead of uniform random sampling, it uses a probabilistic sampler that combines dataset-aware sampling (to isolate biological from technical variation) and hard-negative sampling (to enhance resolution of closely related states). This approach allows CONCORD to perform dimensionality reduction, batch correction, and denoising within a single model, without deep architectures or auxiliary losses. It robustly captures diverse topological features in simulations and real data, scales to million-cell atlases, and generalizes across technologies and species.

Personal highlights

From a limitation to a strength: CONCORD reframes contrastive learning’s sensitivity to minibatch composition as a design feature, using dataset-aware sampling to restrict contrasts within batches, effectively preventing the model from learning technical artifacts while preserving biological variation.
Hard negatives for high-resolution landscapes: by enriching minibatches with hard-negative samples (closely related cells), the model is forced to learn subtle distinctions between neighboring states, dramatically improving the resolution of fine-grained trajectories and rare subtypes.
Minimalist architecture, maximal performance: using only a single-hidden-layer neural network, CONCORD outperforms state-of-the-art methods, proving that sampling strategy alone, not model complexity, is often the key to learning coherent, denoised representations.
Topological faithfulness in latent spaces: CONCORD uniquely preserves global topological structures (like loops and branching trees) and local geometric neighborhoods, validated through persistent homology and trustworthiness metrics, critical for accurately mapping development and disease progression.
Interpretable, context-aware latent dimensions: the framework produces a dense, interpretable latent space where each dimension encapsulates context-specific gene coexpression programs, enabling gradient-based attribution to uncover biological mechanisms at single-cell resolution.

Why should we care?

By unifying batch correction, denoising, and dimensionality reduction into one principled sampling strategy, CONCORD provides a robust, scalable, and interpretable foundation for building coherent cell-state atlases across technologies, time points, and even species.

CellWHISPER: Inferring Direct Cell–Cell Communication from Spatial Transcriptomics

Kumar et al. bioRxiv (2026). https://doi.org/10.64898/2026.01.07_697982

The paper in one sentence

CellWHISPER is a statistically rigorous, confounder-aware method that infers direct cell–cell communication, including gap junction and ligand–receptor signalin, from spatial transcriptomics data, while controlling for false positives and uncovering higher-order interaction patterns.

Summary

CellWHISPER addresses key limitations in existing tools for cell–cell communication (CCC) inference, which often suffer from high false-positive rates and focus narrowly on ligand–receptor interactions. By introducing a permutation-aware analytical null model, CellWHISPER conditions on spatial organization and gene expression to produce calibrated z-scores for CCC quadruplets (two cell types, two signaling genes). It also incorporates a latent variable model to distill recurring communication patterns and enables differential analysis across conditions. Applied to mouse brain data, CellWHISPER recapitulates known glial coupling, discovers novel excitatory neuron–astrocyte gap junction interactions, and identifies Alzheimer’s disease–specific shifts in microglial communication.

Personal highlights

Confounder-aware statistical framework for direct communication inference CellWHISPER employs an analytical null model that conditions on cell-type–specific spatial organization and expression profiles, drastically reducing false positives compared to existing tools.
First large-scale mapping of the brain’s “connexin code”: using gap junction genes as signaling pairs, the method systematically uncovers glial and neuronal coupling patterns, including novel excitatory neuron–astrocyte interactions mediated by Cx36.
Latent variable model extracts interpretable communication modules: beyond quadruplet-level detection, the model infers higher-order patterns—cell-type similarity, signaling-gene preference, and condition-specific modules—without collapsing spatial or molecular context.
Differential analysis reveals disease-relevant communication rewiring: in an Alzheimer’s model, CellWHISPER identifies preserved astrocytic Cx43 networks and increased microglia-associated gap junction communication, highlighting region-specific neuroinflammatory signatures.

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post

Ready for more?