Weekly reads 16/2/26
From ecDNA to epigenomic priming: modeling what averages miss
This week's papers share a common insight: some of the most important biological signals are encoded in variance, context, and hidden structure across scales, rather than mean values. scAmp uses the stochastic inheritance of extrachromosomal DNA to detect oncogene amplifications from single-cell copy-number variation, resolving subclonal ecDNA heterogeneity and its phenotypic consequences directly in patient tumours. Svensson revisits a long-standing failure mode in SCVI, revealing that low-UMI cells collapse toward a learned bias point rather than posterior collapse, and demonstrates how self-supervised augmentation can rescue biological signals that would otherwise be lost. SPATIA views spatial biology as inherently hierarchical, combining morphology, gene expression, and tissue context to enable the controlled generation of microenvironment-dependent phenotypes. The scTumor Atlas prioritises representative malignant states over maximal aggregation, resulting in a useful, interpretable pan-cancer reference for comparing cell lines and predicting gene dependencies. scVital reframes cross-species integration by explicitly removing species signals while preserving conserved cancer cell states, revealing common treatment-resistant programs. Furthermore, OneCell CUT&Tag demonstrates that epigenomic priming can occur before transcriptional commitment, capturing multi-omic state transitions within a single cell.
Across these studies, a recurring theme emerges: when we model biology with the appropriate structure—variance-aware, depth-invariant, hierarchical, species-agnostic, or multi-layered—we discover programs that bulk averages and single-modality analyses consistently overlook.
Preprints/articles that I managed to read this week
scAmp: Analyzing focal gene amplifications at single-cell resolution
Jones, M. G et al. bioRxiv (2026). https://doi.org/10.64898/2026.02.14.705928
The paper in one sentence
scAmp is a probabilistic framework that detects extrachromosomal DNA (ecDNA) amplifications from single-cell copy-number data by leveraging the increased variance caused by random ecDNA inheritance, enabling analysis of subclonal heterogeneity and phenotypic consequences across patient tumors.
Summary
scAmp is an algorithm that detects ecDNA-amplified genes from single-cell copy-number data by exploiting a fundamental biological difference: ecDNAs lack centromeres and are inherited randomly during mitosis, generating greater copy-number variance in cell populations compared to stable chromosomal amplifications. The authors train a multi-layer perceptron on simulated copy-number distributions from a forward-time evolutionary model, featurizing each gene’s distribution across cells by its mean, variance, coefficient of variation, deciles, and interquartile range. scAmp achieves an average precision of 0.96 on simulated data and perfect agreement with previously characterized cell lines, outperforming a null model based on mean copy-number alone (AP 0.89). Crucially, scAmp corrects misclassifications from WGS: the breast cancer cell line BT474 was predicted by WGS to have ecDNA-amplified ERBB2, but scAmp correctly predicted chromosomal amplification, confirmed by metaphase FISH showing ERBB2 co-localized with DAPI-stained chromosomes. Applying scAmp to 73 patient tumors profiled with single-cell ATAC-seq through TCGA reveals ecDNA prevalence across cancer types (gliomas, lung, breast), identifies frequently amplified oncogenes (EGFR, MYC, KRAS), and enables phenotypic analysis. Tumors with ecDNA show shifts in immune composition, T cells enriched in BRCA and LUAD, macrophages in GBMx. Within tumors, ecDNA+ cancer cells exhibit upregulation of glycolysis and hypoxia pathways compared to ecDNA− cells from the same tumor. scAmp’s single-cell resolution reveals striking subclonal heterogeneity. In one GBM tumor, while ~80% of cells contain an MDM2-amplifying ecDNA, distinct subclones harbor additional ecDNAs amplifying MYC or CDK4, with corresponding changes in chromatin accessibility. Finally, scAmp generalizes to clinical FFPE samples analyzed by DNA FISH, correctly classifying ecDNA status in xenograft tumors and a tissue microarray of patient samples.
Personal highlights
Variance-based ecDNA detection from single-cell copy-number data: scAmp leverages the non-Mendelian inheritance of ecDNA, random segregation during mitosis due to lack of centromeres, which generates greater copy-number variance across cell populations compared to stable chromosomal amplifications. This biological insight enables discrimination that bulk WGS cannot achieve, as demonstrated by the BT474 case where WGS misclassified ERBB2 while scAmp correctly predicted chromosomal integration.
Simulation-trained neural network outperforms mean-based models: the authors train a multi-layer perceptron on copy-number distributions generated from a forward-time evolutionary model of ecDNA and chromosomal amplification dynamics. By featurizing distributions with statistics beyond the mean (variance, deciles, IQR), scAmp maintains accuracy even for highly amplified chromosomal loci (copy-number >10) where mean-based models fail.
Subclonal ecDNA heterogeneity and phenotypic consequences: In a GBM tumor, scAmp resolves a dominant MDM2-ecDNA clone with two subclones acquiring additional MYC or CDK4 ecDNAs, revealing ongoing ecDNA diversification. Within tumors, ecDNA+ cancer cells show distinct chromatin accessibility and pathway activation (glycolysis, hypoxia) compared to ecDNA− cells from the same tumor, enabling functional dissection of ecDNA effects.
Clinical applicability to FFPE and DNA FISH: scAmp generalizes beyond single-cell genomics to clinically relevant modalities, correctly classifying ecDNA status from interphase DNA FISH data in xenograft tumors and a 14-sample tissue microarray of patient tumors with MYC amplifications, demonstrating potential for retrospective analysis of archival pathology specimens.
Why should we care?
Extrachromosomal DNA is not a rare curiosity, it appears in approximately 17% of primary tumors overall and is associated with significantly worse patient outcomes. Yet our understanding of ecDNA has been constrained by the limitations of bulk sequencing, which cannot resolve which cells within a tumor carry ecDNA, how ecDNA evolves over time, or what transcriptional consequences it confers. scAmp opens these questions by transforming single-cell copy-number data, already generated by assays like scATAC-seq, into a quantitative readout of ecDNA status.
Improving SCVI for low-count cells through self-supervised augmentation
Svensson, V. bioRxiv (2026). https://doi.org/10.64898/2026.02.11.705441
The paper in one sentence
By adding binomial thinning augmentation and a cross-correlation loss during training, SCVI can learn representations that preserve biological signal for low-UMI cells, which typically collapse to a learned bias point, enabling analysis of cells that would otherwise be discarded.
Summary
Single-cell RNA sequencing data suffers from variation in total molecule counts (library size) between cells, a major source of nuisance variation. SCVI, a count-based variational autoencoder, is designed to integrate out this variation, but cells with extremely low UMI counts still separate from high-UMI cells in learned representations and are typically filtered out before analysis. Svensson investigates the mechanism behind this failure by artificially reducing the UMI counts of high-UMI cells through binomial thinning and passing them through trained SCVI encoders across six representative datasets. The key finding: as UMI depth decreases, cells converge toward a learned bias point in the encoder’s latent space, a fixed point representing a cell with zero observed molecules. This convergence is distinct from classical posterior collapse driven by KL regularization; massively increasing the KL term produces a different failure mode (collapse to the origin), while the bias point can be far from the origin. To address this, Svensson modifies the training procedure in two ways: (1) binomial thinning augmentation, artificially subsampling counts during training to expose the model to low-depth cells, and (2) a cross-correlation loss between embeddings of original and thinned cells, encouraging the encoder to produce similar representations regardless of depth. This approach is inspired by self-supervised learning methods like Barlow Twins, which reduce redundancy in representations. Ablation experiments show that augmentation alone is insufficient and degrades performance even at high depths. The cross-correlation loss is necessary, but reconstruction loss is also essential, pure self-supervision without reconstruction loses biological signal. The optimal configuration (JointEmbed with w=100) preserves cluster membership and condition differences down to ~100 UMI depth, where standard SCVI fails completely (cluster accuracy 0.083 vs. 0.280 at 100 UMI; condition accuracy 0.383 vs. 0.440). These gains come without sacrificing reconstruction quality.
Personal highlights
Identification of bias point convergence distinct from posterior collapse: by systematically thinning high-UMI cells and tracing their trajectories through trained encoders, Svensson reveals that low-UMI cells collapse to a learned bias point, not the origin, demonstrating that this failure mode is distinct from classical KL-driven posterior collapse. Massive KL weighting produces collapse to the origin, while the bias point can be arbitrarily far, clarifying a long-standing empirical observation in the field.
Binomial thinning as self-supervised augmentation: the training modification exposes the model to artificially subsampled versions of high-UMI cells during training, forcing the encoder to learn representations invariant to total count depth. This simple data augmentation strategy, binomial thinning of observed counts, is biologically grounded in the sampling process of scRNA-seq and requires no external annotations.
Cross-correlation loss preserves biological signal across depths: borrowing from self-supervised learning (Barlow Twins), the added loss term encourages the embeddings of original and thinned cells to be similar while reducing redundancy across dimensions. This prevents the encoder from learning depth-dependent features and maintains cluster structure and condition differences at low UMI depths where standard SCVI fails.
Ablation reveals necessity of both augmentation and reconstruction: augmentation alone degrades performance even at high depths, and pure self-supervision without reconstruction loses biological signal entirely. The optimal configuration requires the full combination: augmentation, cross-correlation loss, and reconstruction loss, demonstrating that representation learning and generative modeling are complementary rather than substitutable.
Practical extension of usable cell range without sacrificing quality: the modified model preserves cluster accuracy and condition differences down to ~100 UMI depth (vs. standard SCVI failing below ~1000 UMI) with minimal impact on reconstruction metrics. This extends the range of analyzable cells, enabling inclusion of low-quality cells from precious samples or cost-effective shallow sequencing.
Why should we care?
Single-cell genomics faces a persistent trade-off: to get good data, you need high-quality cells with many transcripts; to work with precious or difficult samples, you often get low-quality cells with few transcripts. Standard practice is to filter out the latter, discarding potentially valuable biological material because current computational tools cannot handle it. Svensson’s work shows that this trade-off is not inevitable. By understanding the precise mechanism by which SCVI fails on low-UMI cells (convergence to a learned bias point, not posterior collapse), he designs a targeted fix: train the model to be invariant to total count depth by showing it augmented versions of its own data and enforcing representation consistency. The result is a model that retains biological signal down to ~100 UMI—cells that would normally be thrown away.
SPATIA: Multimodal Generation and Prediction of Spatial Cell Phenotypes
Kong, Z et al. bioRxiv (2025). https://doi.org/10.64898/2026.02.18.706593
The paper in one sentence
SPATIA is a hierarchical multimodal model that integrates cell morphology, gene expression, and spatial context across scales, from individual cells to niches to whole tissues, to enable both predictive analysis and controllable generation of microenvironment-dependent cellular phenotypes.
Summary
Image-based spatial transcriptomics technologies provide matched measurements of cellular morphology and gene expression in intact tissue, but existing methods typically analyze these modalities in isolation, lack cell-level resolution, or cannot model how local spatial context shapes cellular phenotypes. Kong and colleagues introduce SPATIA, a unified framework that learns spatially aware representations by explicitly modeling biological structure across three nested scales: individual cells, local niches (256×256 px regions containing 10–30 cells), and whole-slide tissue context. At the cell level, SPATIA fuses image-derived morphological tokens and gene expression embeddings via cross-attention. At the niche level, a transformer aggregates neighboring cell embeddings with regional image patches to model local cell–cell interactions. At the tissue level, a global transformer captures long-range dependencies across the full slide. This hierarchical design enables SPATIA to learn representations that integrate intrinsic cell state with extrinsic spatial context. More importantly, SPATIA introduces a spatially conditioned generative framework for predicting morphological outcomes of perturbations without requiring paired pre–post data. The authors construct weak supervision pairs between control and perturbed cells using entropy-regularized optimal transport (OT) in gene expression space, constrained by lineage consistency and spatial proximity. To address noise in these weak matches, they propose a confidence-aware flow matching objective that reweights training trajectories based on OT coupling uncertainty. A morphology-profile alignment loss further ensures generated cells match the distribution of real target morphologies in CellProfiler feature space. Across 12 tasks spanning phenotype generation, cell annotation, clustering, gene imputation, and cross-modal prediction, SPATIA outperforms 18 existing models, achieving an 8% improvement in generative fidelity (FID/KID) and up to 3% gains in predictive benchmarks. Ablation studies confirm that each hierarchical level contributes meaningfully, and robustness analysis shows the model remains stable under moderate OT pairing errors (10–20% corruption).
Personal highlights
Hierarchical multi-scale architecture from cells to tissue: SPATIA explicitly models biological organization across three nested levels, individual cells, local niches (256×256 px regions), and whole-slide tissue context, using cross-attention transformers at each scale. This design captures both fine-grained cellular features and the spatial dependencies that govern tissue function, enabling representations that integrate intrinsic cell state with extrinsic microenvironmental context.
Confidence-aware flow matching for perturbation modeling without paired data: to predict morphological outcomes of biological transitions, where paired pre–post observations are unavailable, SPATIA constructs weak supervision pairs via optimal transport in gene expression space, constrained by lineage and spatial proximity. A confidence-weighting scheme downweights uncertain OT matches during flow matching training, while a condition-contrastive regularization encourages the model to distinguish different transition types, enabling controllable generation without brittle one-to-one correspondences.
Morphology-profile alignment ensures biological fidelity: generated cell images are evaluated not only by perceptual metrics (FID/KID) but also by their alignment with real target distributions in CellProfiler feature space. A sliced Wasserstein distance loss explicitly enforces that generated morphologies match the statistical properties of true target cells, ensuring that improvements in visual realism translate to biological correctness.
MIST: A large-scale multi-platform spatial transcriptomics atlas: the authors assemble and curate MIST, a dataset of 25.9 million cell–gene pairs from 74 sources spanning 17 tissues, 60 donors, and four major platforms. This resource enables cross-platform benchmarking and provides a foundation for training models that generalize across technical and biological variation.
Unified performance across generative and predictive tasks: SPATIA achieves state-of-the-art results on both fronts: 8% improvement in generative fidelity over specialized models like GeneFlow and MorphDiff, while matching or exceeding task-specific models on cell annotation, clustering, biomarker prediction, and gene expression imputation. This demonstrates that a single model can support both exploratory simulation and quantitative downstream analysis without sacrificing either capability.
Why should we care?
Spatial transcriptomics has transformed our ability to see where genes are expressed in tissue, but connecting that molecular information to what cells actually look like, and predicting how they might change under disease or treatment, has remained out of reach. Existing models either ignore morphology, lose cell-level resolution, or cannot simulate perturbations. SPATIA bridges these gaps by treating spatial biology as it actually is: hierarchical, multimodal, and context-dependent. The confidence-aware flow matching framework provides a general recipe for learning perturbation models when paired data doesn’t exist, a common scenario in biology where destructive measurements prevent tracking the same cell over time. The morphology-profile alignment loss offers a way to ground generative models in biologically meaningful features rather than pixel-level statistics alone. And the MIST dataset itself will likely become a valuable community resource for training and benchmarking spatial models.
A Pan-Cancer Single-Cell Atlas to Evaluate Tumor Identity, Cell Line Concordance, and Dependency Mapping
Reveron-Thornton, R. F et al. bioRxiv (2026). https://doi.org/10.64898/2026.02.14.705396
The paper in one sentence
The scTumor Atlas is a curated, quality-controlled pan-cancer single-cell reference of 135,424 malignant cells from 499 samples across 36 cancer types that enables systematic evaluation of tumor identity, benchmarking of cancer cell line models, and inference of gene dependencies directly from single-cell transcriptional states.
Summary
Bulk RNA sequencing has enabled large-scale pan-cancer analyses but obscures cancer cell-specific programs due to admixture with nonmalignant cells. Single-cell RNA sequencing resolves this, yet existing atlases often prioritize maximal data aggregation over biological coherence, resulting in unwieldy resources with variable data quality and limited interpretability. The authors here fundamentally different approach: rather than maximizing cell count, they prioritize representative malignant transcriptional states. Starting from public scRNA-seq datasets, they apply uniform stringent quality control (cells with <5,000 UMIs or >10% mitochondrial transcripts excluded), doublet removal with Scrublet, and careful malignant cell annotation. To prevent any single dataset from dominating, they implement a two-step downsampling framework using Mahalanobis distance from the centroid in principal component space—first per sample (capping at 5,000 cells), then per cancer type (capping at 5,000 representative cells). After integration with Harmony and scANVI, the final scTumor Atlas contains 135,424 high-quality malignant cells from 499 samples spanning 36 adult and pediatric malignancies. The atlas preserves lineage-specific transcriptional programs, with epithelial, mesenchymal, hematologic, and neuroendocrine cancers forming coherent clusters. Pathway analysis recapitulates expected biology: oxidative phosphorylation enriched in lung squamous carcinoma but not ALL, androgen signaling in prostate cancer, estrogen signaling in breast cancer, KRAS signaling in pancreatic cancer, and EMT signatures in sarcomas. These patterns align with independent TCGA bulk RNA-seq data, validating that the selected malignant states reflect broader tumor biology. The authors then use the atlas to evaluate cancer cell line (CCL) fidelity. By projecting single-cell CCL profiles into the same latent space, they quantify transcriptional similarity between cell lines and primary tumors. This reveals substantial heterogeneity: some pancreatic lines (PK59, DANG) closely match primary PAAD centroids, while others (PANC1, SW1990) diverge significantly, providing a quantitative framework for model selection. Most importantly, the atlas enables single-cell resolution gene dependency prediction. The authors train ElasticNet regression models on DepMap CRISPR screen data using pseudobulked scRNA-seq from matched cell lines, then apply these models to scTumor Atlas cells to generate predicted gene effect scores (PGES). This recapitulates known lineage-specific dependencies (CDK4 in medulloblastoma, BRAF in melanoma) and identifies putative novel vulnerabilities (QRICH1 in breast cancer, TCF7L2 in gastrointestinal cancers). In a proof-of-concept application to a primary retroperitoneal leiomyosarcoma profiled in-house, the framework predicts dependencies including IGF1R, a target with prior clinical investigation in sarcoma.
Personal highlights
Representative-state sampling over maximal aggregation: unlike atlas efforts that prioritize cell count above all else, the authors use Mahalanobis distance-based downsampling to select up to 5,000 representative malignant cells per cancer type. This yields a compact (135k cells) yet biologically coherent reference that preserves lineage structure while remaining computationally lightweight and interpretable, trading exhaustive inclusion for practical utility.
Stringent quality control and standardized annotation: public datasets vary widely in depth and annotation quality. The authors apply uniform filters (≥5,000 UMIs, ≤10% mitochondrial reads), Scrublet doublet removal, and consistent malignant cell identification, either retaining original annotations or applying cancer-specific rules (e.g., CHGA expression for PNET, keratin scores for CESC). This rigor ensures the atlas reflects genuine malignant states, not technical artifacts or mislabeled cells.
Quantitative benchmarking of cancer cell line fidelity: by projecting single-cell CCL profiles into the same scANVI latent space, the authors compute normalized Euclidean distances between cell line centroids and primary tumor centroids. This provides a continuous, interpretable metric of model concordance,revealing that not all lines for a given cancer type are equally representative, and enabling rational selection of models for translational studies.
Single-cell resolution gene dependency prediction: adapting a framework originally developed for bulk RNA-seq, the authors train ElasticNet models on DepMap CRISPR screens using pseudobulked scCCL expression, then apply them to scTumor Atlas cells. This yields predicted gene effect scores at single-cell resolution, recapitulating known dependencies and nominating novel candidates, bridging high-throughput functional genomics with in vivo tumor heterogeneity.
Personalized dependency inference in a rare tumor: as a translational proof-of-concept, the authors profile a primary retroperitoneal leiomyosarcoma, integrate it into the atlas, and apply the dependency models. The predicted vulnerabilities include IGF1R, a target previously investigated in sarcoma clinical trials, demonstrating that this workflow can generate actionable hypotheses from a single patient sample, particularly valuable for rare cancers where large cohorts are unavailable.
Why should we care?
Cancer research faces a persistent translation gap: we have massive functional genomics datasets from cell lines (DepMap) and massive transcriptional datasets from tumors (TCGA), but connecting them is fraught with difficulty. Bulk tumor profiles are contaminated by stromal and immune signals; cell lines drift in culture; and single-cell atlases have become so large and heterogeneous that they are difficult to use as practical references. The scTumor Atlas takes a different tack. By prioritizing representative malignant states over maximal cell count, it creates a resource that is actually usable, small enough to distribute and query, clean enough to trust, and rich enough to support meaningful comparisons. The Mahalanobis downsampling strategy is a methodological contribution in itself, offering a principled way to balance representation without sacrificing biological signal.
Deep-Learning Tool ScVital Enables Species-Agnostic Integration of Cancer Cell States
Rub, J. et al. Cancer Research (2026). https://doi.org/10.1158/0008-5472.CAN-24-4889
The paper in one sentence
ScVital is a variational autoencoder with adversarial training that embeds single-cell RNA-seq data from different species into a shared latent space, enabling identification of conserved cancer cell states across mouse models and human tumors.
Summary
Genetically engineered mouse models (GEMMs) are essential for cancer research, but cross-species differences limit their predictive value for human disease. Single-cell RNA sequencing captures tumor heterogeneity, yet current integration methods treat cross-species comparison as a batch correction problem, failing to handle species-specific genes and often losing biological signal. Rub and colleagues develop scVital, a deep-learning framework specifically designed for species-agnostic integration. The model combines a conditional variational autoencoder with an adversarially trained discriminator. The encoder maps gene expression into a latent space while the decoder reconstructs the original data. The discriminator attempts to predict species from the latent representation, and the autoencoder is trained to fool it, removing species-specific signal while preserving cellular identity. Crucially, the reconstruction loss is designed to handle species-specific genes: mouse genes do not affect human cell reconstruction and vice versa, allowing integration without forcing all genes into a common feature space. To evaluate integration quality without relying on heuristic post-integration clustering, the authors introduce Latent Space Similarity (LSS) , a metric that computes pairwise cosine distances between pre-annotated cell types in the latent space and calculates the AUC-F1 of correct cell-type pairings. LSS is robust to class imbalance and avoids the variability of clustering-dependent metrics like adjusted Rand index. Benchmarked on normal tissues (muscle, lung, pancreas, liver, bladder), scVital performs comparably to Harmony and scVI but with faster runtime than the deep-learning alternative scDREAMER, and better preserves species-specific cell types that other methods incorrectly merge. Applying scVital to pancreatic ductal adenocarcinoma (PDAC), it aligns classic and basal cell states across mouse models and 24 human patients, while the mouse-specific mesenchymal state remains separate, correctly reflecting biology. In lung adenocarcinoma (LUAD), scVital identifies shared AT2-like and high-plasticity cell states across species. Integration of healthy, injured, and malignant lung tissue reveals similarity between the LUAD high-plasticity state and a damage-associated transient progenitor in mice. Most strikingly, in undifferentiated pleomorphic sarcoma (UPS), a rare cancer with no prior knowledge of cross-species concordance, scVital integrates a KP GEMM with two patient-derived xenografts treated with doxorubicin. It uncovers a treatment-resistant cell state enriched for hypoxia signature (SLC2A1/Glut1) that is conserved across species and expands with prolonged chemotherapy, validated by immunohistochemistry. This state would have been missed by separate analysis of each dataset followed by marker intersection.
Personal highlights
Species-agnostic latent space with adversarial species removal: scVital’s architecture, a VAE with an adversarially trained discriminator, explicitly removes species-specific signal from the latent representation while preserving cellular identity. The reconstruction loss is designed to handle species-specific genes independently, so mouse genes don’t interfere with human cell reconstruction and vice versa, enabling true cross-species integration without forcing all genes into a common feature space.
Latent Space Similarity (LSS): a clustering-free integration metric: Current evaluation metrics (ARI, FM) require clustering post-integration cells, a highly variable and heuristic step. LSS instead computes pairwise cosine distances between pre-annotated cell types in the latent space and calculates the AUC-F1 of correct pairings. It is robust to class imbalance, avoids clustering artifacts, and correctly scores integration quality even for rare cell types that other metrics mis-evaluate.
Preservation of species-specific cell types: In mouse-human muscle integration, other methods erroneously merge mouse neural/glial cells with human mature skeletal muscle. scVital and scDREAMER keep this mouse-specific cluster distinct, a difference reflected in LSS but not in ARI, demonstrating that LSS captures biologically meaningful distinctions that clustering-based metrics miss.
Identification of conserved treatment-resistant hypoxia state in UPS: In a rare sarcoma with no prior knowledge of cross-species concordance, scVital integrates a GEMM and two PDXs treated with doxorubicin, revealing a shared cell state enriched for hypoxia signature (SLC2A1/Glut1) that expands with prolonged treatment. Validated by IHC, this state would have been missed by separate analysis followed by marker intersection, demonstrating scVital’s power to uncover conserved biology masked by strong batch and species effects.
Linking mouse lung injury response to human LUAD plasticity: Integrating healthy lung, alveolar injury, and LUAD data reveals similarity between the mouse high-plasticity cancer cell state (HPCS) and a damage-associated transient progenitor state absent in healthy tissue, suggesting that cancer may co-opt regenerative programs and providing a functional hypothesis for the origin of this aggressive cell state.
Why should we care?
Mouse models are the workhorses of cancer research, but their track record for predicting human outcomes is sobering: less than 10% of animal studies advance to clinical trials, and fewer than 1 in 10 of those gain FDA approval. A major reason is that cross-species differences, both technical and biological, obscure which features of mouse tumors actually reflect human disease. ScVital addresses this by learning what is shared across species and what is specific. Rather than treating mouse and human as two batches to be forcibly merged, it explicitly removes species signal while preserving cellular identity. This lets us ask a fundamentally different question: not “do mouse models resemble human tumors?” but “which cell states are conserved, and which are species-specific?”d
Matched single-cell chromatin, transcriptome, and surface marker profiling captures in vivo epigenomic reprogramming during basal-to-luminal transition in the mammary gland
Schwager, A. et al. bioRxiv (2026). https://doi.org/10.64898/2026.02.16.706078
The paper in one sentence
OneCell CUT&Tag is a low-input method that profiles histone modifications, full-length transcriptomes, and surface markers from the same single cell, revealing that basal mammary epithelial cells harbor epigenomic priming for luminal fate, undetectable at RNA or protein level, and that basal-to-luminal transdifferentiation proceeds via continuous epigenomic remodeling but a binary transcriptomic switch.
Summary
The authors develop OneCell CUT&Tag, a plate-based method that starts from individual cells (as few as one) and generates high-coverage histone modification profiles (H3K27me3, H3K4me1), full-length transcriptomes via FLASH-seq, and surface marker quantification from the same cell. Key innovations include: (i) optimized lysis buffer preserving both chromatin integrity and cytoplasmic mRNA; (ii) carboxylic beads for nuclei isolation enabling serial solution changes without loss; (iii) adaptation of FLASH-seq to limited cytoplasmic extracts. The method achieves median 26k unique DNA fragments/cell (0.77 FrIP) and 8k genes/cell in cell lines, outperforming droplet-based alternatives, and works on fresh or frozen tissues, including a triple-negative breast cancer tumor.
In the mammary gland, they profile 773 epithelial cells across basal and luminal lineages with matched H3K4me1, H3K27me3, RNA, and 14 surface markers. While cytometry and RNA annotations show near-perfect concordance (98%), a subset of basal cells (9%) exhibit luminal-like epigenomes, enriched for H3K4me1 at luminal genes and depleted of H3K27me3 repression, undetectable at RNA or protein level. This epigenomic priming, specific to basal cells, aligns with their known context-dependent multipotency upon lineage ablation or transplantation. To capture the transition in vivo, they transplant 10,000 basal cells into cleared fat pads and profile engrafted cells at 4.5 days. Descendants show continuous epigenomic progression from basal to luminal in H3K4me1 space, with intermediate cells absent in reference populations, while transcriptomes exhibit a binary switch. Transitioning cells upregulate proliferation and downregulate TNFα and p53 signaling, TNFα being a known restrictor of basal multipotency, and upregulate Axl, a stemness driver.
Personal highlights
OneCell CUT&Tag: low-input matched multi-omics from the same cell: unlike existing methods requiring 10⁴–10⁵ starting cells, OneCell works from one cell upward, generating high-coverage histone modification profiles (median 26k fragments), full-length transcriptomes (8k genes), and surface marker data per cell. The method is adaptable to fresh or frozen tissues, including patient tumors, and automation increases throughput to 1,536 cells per run with improved coverage.
Epigenomic priming of basal cells for luminal fate: a subset of basal mammary epithelial cells (9%) displays luminal-like H3K4me1 and H3K27me3 landscapes at luminal genes, despite expressing basal markers at RNA and protein levels. This priming, undetectable without matched multi-omics, aligns with basal cells’ known capacity to regenerate luminal lineages upon ablation or transplantation, suggesting epigenomic “readiness” enables rapid fate activation.
Continuous epigenomic remodeling vs. binary transcriptomic switch during transdifferentiation: following basal cell transplantation, descendants at 4.5 days show progressive H3K4me1 remodeling from basal to luminal states, with intermediate cells absent in steady-state epithelium. In contrast, transcriptomes exhibit a sharp binary switch between basal and luminal identities. This reveals that epigenomic reprogramming precedes and potentially enables the transcriptional commitment.
MOFA disentangles omic-layer-specific contributions to cell identity: Joint factor analysis of RNA, H3K4me1, and H3K27me3 identifies factors capturing basal identity through combined modalities (e.g., factor 7: Acta2, Krt14 expression + Trp63/Trp73 motif accessibility) and others revealing epigenomic-only distinctions (factor 2: H3K4me1 at stemness-associated Zfx motifs, undetectable in RNA). This demonstrates how matched multi-omics resolves regulatory layers that single modalities miss.
Why should we care?
Cell identity is not encoded in a single molecular layer, it emerges from the interplay of surface phenotype, transcriptional programs, and the epigenetic landscapes that prime or restrict them. Yet most single-cell technologies capture only one layer, forcing us to infer regulatory relationships across cells rather than measure them within the same cell. OneCell CUT&Tag changes this. By delivering matched epigenome, transcriptome, and surface data from the same cell, starting from as few as on, it opens the door to studying rare populations where every cell counts: early embryos, stem cell niches, patient biopsies. The mammary gland findings illustrate the power: a subset of basal cells are epigenomically primed for luminal fate, invisible to standard RNA or protein profiling. This priming likely explains their context-dependent multipotency and may represent a general mechanism by which tissues balance stability with regenerative capacity.
Other papers that peeked my interest and were added to the purgatory of my “to read” pile
Thanks for reading.
Cheers,
Seb.


