Weekly reads 26/1/26

When simple representations outperform complex models in biology

Feb 01, 2026

The overall theme of this week’s reads: biological insight doesn’t always require more complexity, sometimes it comes from choosing the right abstraction. From cancer cell geometry to population-scale viral genomics, these studies show how carefully designed representations can cut through noise, reveal hidden structure, and guide actionable decisions. ORIGAMI demonstrates that just two physical measurements—cell surface area and volume—are enough to map cancer cell plasticity and rationally design combination therapies across tumor types. On the computational side, CellScope builds interpretable, hierarchical single-cell atlases by explicitly modeling manifold structure and noise, while a large-scale benchmark of perturbation prediction models shows that generalization depends more on biological context than model size. Several works confront technical confounders head-on: cellAdmix reveals how segmentation errors dominate spatial transcriptomics analyses and offers an unsupervised correction; HARMONIC integrates histology to causally filter spurious cell–cell communication calls. Beyond cells and tissues, population-scale reanalysis of WGS data uncovers EBV DNAemia as a genetically regulated biomarker tied to immune presentation, while NANITE reimagines genome editing delivery by turning cells into temporary CRISPR factories.

Preprints/articles that I managed to read this week

Simple Cell Geometry Metrics Reveal Cancer Cell Diversity and Plasticity and Guide Combination Therapy

De Blander et al. Preprint (2026). DOI: https://doi.org/10.21203/rs.3.rs-8278743/v1

The paper in one sentence

By measuring just two physical parameters, cell surface area and volume, researchers can track cancer cell states, predict therapy resistance, and design effective drug combinations that work across diverse cancer types.

Summary

This study introduces ORIGAMI, a scalable and inexpensive flow cytometry method that uses cell surface area (S) and volume (V) to map the diversity and plasticity of cancer cells. In melanoma, different transcriptomic states correspond to distinct S/V profiles, forming a “morphostate” landscape. Under therapy, cells shift predictably within this landscape, increasing surface area during dedifferentiation or volume during senescence-like states. Using ORIGAMI-guided screening, the team identified drugs that push cells toward more vulnerable regions: brefeldin A reduces surface area and promotes differentiation, while nigericin exploits lysosomal vulnerabilities in enlarged, therapy-resistant cells. The combination of these agents with standard targeted therapies (MEK and CDK4/6 inhibitors) showed potent activity in melanoma and other cancers, overcame immunotherapy resistance, and extended survival in preclinical models, all without relying on genetic markers.

Personal highlights

Geometry as a proxy for cell state: two simple physical metrics, surface area and volume, capture the transcriptional and phenotypic diversity of cancer cells, replacing costly multi-omics profiling with a rapid, scalable assay.
Dynamic tracking of therapy-induced plasticity: ORIGAMI reveals predictable “morphostate” trajectories under drug pressure, mapping how cells adapt by increasing surface (dedifferentiation) or volume (senescence-like states).
Lysosomal vulnerability in enlarged cells: therapy-induced cell volume expansion creates a dependency on lysosomal function, which can be targeted with the iron-sequestering compound nigericin to trigger ferroptosis.
Surface reduction restores drug sensitivity: ER–Golgi disruptors like brefeldin A shrink cell surface area, push mesenchymal-like cells toward a differentiated state, and re-sensitize them to targeted therapies.
Pan-cancer applicability: the S/V framework and derived combination therapy (TPNB: trametinib, palbociclib, nigericin, brefeldin A) show efficacy across melanoma, breast, lung, ovarian, and other cancers, independent of driver mutations.

Why should we care?

ORIGAMI reframes how we understand and target cancer: instead of chasing genetic mutations, it uses universal physical traits, cell size and shape, to monitor tumor heterogeneity and adaptation. This approach is faster, cheaper, and more accessible than genomic profiling, making it feasible for clinical tracking and personalized combination therapy. The derived drug strategy (TPNB) not only overcomes non-genetic resistance but also turns therapy-induced adaptations into vulnerabilities, effectively “steering” cells toward more treatable states

CellScope: A High-Performance Cell Atlas Workflow with Tree-Structured Representation

Li et al. Nature Com (2025). https://doi.org/10.5281/zenodo.17636503

The paper in one sentence

CellScope is a novel manifold-learning-based computational framework that constructs high-resolution, hierarchical cell atlases by intelligently filtering noise, performing multi-level clustering, and providing interpretable tree-structured visualizations, significantly outperforming existing tools in accuracy, speed, and biological insight.

Summary

CellScope addresses key limitations in single-cell RNA-seq analysis: such as biased gene selection, lack of hierarchical representation, and poor handling of technical noise, by introducing a manifold-based workflow. It uses a two-stage manifold fitting process to separate biological signal from housekeeping and technical noise, selects informative genes via high-density “manifold seeds,” performs graph-based agglomerative clustering to capture nested cell types, and generates tree-structured visualizations that integrate UMAP with hierarchical clustering. Validated across 36 diverse datasets, CellScope consistently outperforms Seurat, Scanpy, and newer methods in clustering accuracy, rare cell detection, computational efficiency, and interpretability, while enabling novel biological discoveries such as oligodendrocyte subtypes and COVID-19-specific immune signatures.

Personal highlights

Manifold-aware gene selection via density-distance seeds: CellScope identifies biologically informative genes by selecting high-density, well-separated “manifold seeds” and their reliable neighboring cliques, effectively distinguishing signal from housekeeping and technical noise without relying on parameter-heavy dispersion-based methods.
Two-stage manifold fitting for dual-noise removal:tThe framework explicitly models and removes two types of noise, housekeeping gene expression and technical dropout, by first filtering ubiquitous genes and then projecting low-density cells toward high-density submanifolds, preserving true biological structure.
Graph-based hierarchical clustering with tree-structured visualization: unlike flat clustering methods, CellScope builds a cell–cell similarity graph and performs agglomerative clustering to produce a multi-resolution hierarchy, visualized as an interactive tree that integrates UMAP layouts with dendrogram branching.
Dynamic molecular identity classification for genes: moving beyond binary marker/non-marker labels, CellScope classifies genes into three dynamic roles—housekeeping, moderately cell-type-related, and strongly cell-type-related—based on expression differences across clustering levels, revealing context-dependent gene functions.
Interpretable, parameter-light, and scalable design: CellScope requires minimal hyperparameter tuning, adapts to dataset size via adaptive distance metrics (Euclidean/Jaccard), and maintains high performance across both small and large (>250k cells) datasets while offering clear, biologically grounded explanations for its outputs.

Benchmarking algorithms for generalizable single‑cell perturbation response prediction

Wei et al. Nature Methods (2025). https://doi.org/10.1038/s41592-025-02980-0

The paper in one sentence

This study presents a comprehensive benchmark of 27 computational methods for predicting single‑cell perturbation responses, rigorously evaluating their generalizability across two key scenarios, cellular context generalization and perturbation generalization, and providing practical guidance for method selection based on dataset characteristics.

Summary

Single‑cell perturbation technologies like Perturb‑seq enable systematic investigation of gene functions, but large‑scale experimental screens remain costly and complex. Computational methods have emerged to predict perturbation effects, yet their true generalizability across unseen cellular contexts or new perturbations is often unclear. This paper systematically benchmarks 27 state‑of‑the‑art methods, including foundation models (e.g., scGPT, scFoundation) and simpler baselines, across 29 datasets using six complementary metrics (MSE, PCC‑delta, Wasserstein distance, etc.). The evaluation is structured around two core scenarios: (1) cellular context generalization, where models trained on certain cell lines/patients predict effects in new contexts; and (2) perturbation generalization, where models predict effects of unseen genetic or chemical perturbations. Key findings reveal that no single method performs best universally: trVAE, CellOT and inVAE excel in cellular‑context generalization; GenePert and scGPT lead in genetic‑perturbation prediction; chemCPA is strongest for chemical perturbations. Importantly, most methods struggle when test contexts differ substantially from training data, highlighting limited generalizability. The authors propose a cellular‑context embedding strategy to improve cross‑context prediction and provide decision guides to help researchers select tools based on their data type, size, and goal.

Personal highlights

No one‑size‑fits‑all method: qcross 29 diverse datasets, no single algorithm consistently outperformed others, underscoring that method choice must be tailored to the specific prediction scenario, whether generalizing across cell types, patients, species, or unseen perturbations.
Scenario‑specific leaders emerge: in cellular‑context generalization, trVAE, CellOT and inVAE delivered the best overall accuracy; for genetic single‑perturbation prediction, GenePert and scGPT excelled; for chemical perturbations, chemCPA topped the rankings, revealing that domain‑specialized approaches still outpace general‑purpose foundation models in many settings.
Generalization fails when contexts differ too much: models performed poorly when predicting effects in cellular contexts highly dissimilar to training data, exposing a critical limitation in current algorithms’ ability to capture inter‑context heterogeneity, a key hurdle for real‑world applications like cross‑patient or cross‑species prediction.
Cellular‑context embedding as a promising fix: to address poor cross‑context generalization, the authors propose integrating prior biological knowledge via cell‑line embeddings, demonstrating that such context‑aware representations can boost prediction accuracy, especially on metrics like PCC‑delta and Common‑DEGs.
Practical decision guides for tool selection: beyond rankings, the study offers clear, data‑driven guidance, such as using scVIDR for dose‑response data, linearModel/scouter for combinatorial genetic perturbations, and baseReg for chemical combinations, helping researchers navigate the crowded tool landscape with confidence.

Why should we care?

This benchmark cuts through the hype surrounding single‑cell perturbation prediction models, especially large foundation models, by rigorously testing their real‑world generalizability. For computational biologists, it provides an evidence‑based map to navigate method selection, saving time and resources while highlighting where current models fall short. For experimentalists, it underscores that prediction tools are not yet plug‑and‑play; understanding a model’s limits, such as its sensitivity to context differences or perturbation strength, is essential for designing interpretable in‑silico screens. More broadly, the study signals that true “generalizable” perturbation AI will require deeper integration of biological priors and context‑aware architectures, moving beyond purely data‑driven patterns.

Impact and correction of segmentation errors in spatial transcriptomics

Mitchel, J. et al. Nature Genetics (2026). https://doi.org/10.1038/s41588-025-02497-4

The paper in one sentence

Segmentation errors in imaging-based spatial transcriptomics lead to widespread molecular admixture between cells, which dominates downstream analyses, but a factorization-based method can detect and correct these artifacts without requiring matched single-cell data or membrane stains.

Summary

The study demonstrates that imperfect cell segmentation in high-resolution spatial transcriptomics data results in significant “molecular admixture,” where transcripts from neighboring cells are incorrectly assigned. This artifact strongly biases common analyses such as differential expression, cell–cell interaction inference, and ligand–receptor pairing. Using multiple tissues and platforms, the authors show that admixture signals often dominate top results. To address this, they introduce cellAdmix, a computational pipeline that applies non-negative matrix factorization (NMF) to subcellular neighborhood composition vectors followed by a conditional random field (CRF) to assign molecules to factors. A key innovation is the “cell-bridging” score, which automatically identifies admixture factors based on their spatial positioning near cell borders, without requiring matched scRNA-seq or membrane stain data. The method significantly reduces false-positive signals while preserving native expression patterns.

Personal highlights

Factorization of molecular neighborhoods for subcellular disentanglement: by applying weighted NMF to neighborhood composition vectors at subcellular resolution, the method extracts recurrent expression patterns, including both real cellular compartments and admixture clusters, without relying on prior cell-type labels.
Spatially aware factor assignment with conditional random fields: a CRF model incorporates spatial proximity and gene composition to assign each transcript to an NMF factor, enabling precise isolation of admixed molecules that are clustered near cell edges or across z-planes.
Unsupervised “cell-bridging” score for admixture detection: instead of depending on external data, the method scores factors based on whether their molecules are systematically minor in a target cell type and located closer to the border of a potential source cell, a purely geometric and expression-based criterion.
Bayesian admixture probability metric for benchmarking: a novel metric quantifies per-cell admixture levels by integrating spatial adjacency and scRNA-seq marker exclusivity, allowing objective comparison across segmentation methods and correction strategies.
Cross-platform and cross-tissue validation of correction impact: the pipeline consistently reduced artifactual signals in DE, GSEA, ligand–receptor inference, and multicellular program analysis across six diverse datasets from mouse hypothalamus to human ovarian cancer, highlighting its generalizability.

Why should we care?

Even with state-of-the-art segmentation, molecular admixture persists and can dominate results. cellAdmix provides a post-segmentation correction layer that makes downstream analyses, from differential expression to cell–cell communication, more reliable and biologically interpretable.

Population-scale sequencing resolves determinants of persistent EBV DNA

Nyeo, S. S. et al. Nature (2026). https://doi.org/10.1038/s41586-025-10020-2

The paper in one sentence

By reanalyzing whole-genome sequencing data from nearly 750,000 individuals, researchers developed a scalable pipeline to quantify circulating EBV DNA, revealing it as a genetically influenced biomarker linked to autoimmune, respiratory, and neurological diseases, and tied to HLA-mediated antigen presentation.

Summary

This study repurposed whole-genome sequencing (WGS) data from the UK Biobank and All of Us cohorts to detect and quantify Epstein–Barr virus (EBV) DNA from blood-derived sequencing libraries. After masking highly repetitive regions in the viral reference contig, the authors established EBV DNAemia, a binary trait indicating high circulating EBV DNA levels (≥1.2 viral copies per 10⁴ human cells). They found that ~10% of individuals had EBV DNAemia, which was reproducibly associated with phenotypes like rheumatoid arthritis, COPD, and depressive episodes. Genome-wide association studies identified 22 loci linked to EBV DNA persistence, with strong signals in the HLA region. The team then developed a harmonic best rank (HBR) score, based on NetMHC-predicted EBV peptide binding, to show that stronger predicted MHC class II presentation correlates with lower EBV DNA persistence. The work establishes EBV DNAemia as a polygenic trait, implicates antigen presentation as a key control mechanism, and provides a framework for studying viral persistence using existing population-scale sequencing data.

Personal highlights

Retrospective viral quantification from existing WGS at petabase scale: by extracting reads mapping to the EBV contig in hg38 and masking two highly repetitive, bias-prone regions, the authors turned discarded sequencing data into a robust, continuous measure of circulating EBV DNA, enabling analysis across ~750,000 individuals without new experiments.
EBV DNAemia as a genetically informed binary biomarker: through careful thresholding against serostatus, EBV DNAemia identifies the ~10% of seropositive individuals with the highest viral DNA loads, transforming a noisy continuous signal into a tractable trait for GWAS and PheWAS with strong genetic architecture.
HLA-centric genetic architecture revealed through peptide presentation scoring: the strongest genetic associations localized to MHC class I and II genes. Using NetMHCpan/IIpan, the team computed per-person, per-allele harmonic best rank (HBR) scores and showed that stronger predicted EBV peptide presentation, especially via class II alleles, correlates with protection from EBV DNAemia.
Population-scale viral genomics to distinguish functional variants: by aggregating viral reads across hundreds of thousands of individuals, the authors assessed the prevalence of EBV variants previously linked to nasopharyngeal carcinoma, finding that most are common in healthy populations, highlighting the value of large-scale control data for interpreting viral pathogenicity.
Integration across cohorts, assays, and modalities for validation: signals replicated across UK Biobank and All of Us, were consistent with single-cell RNA-seq (minimal lytic transcription), and connected to enriched pathways (antigen processing/presentation) and cell types (B cells, dendritic cells) via independent multi-omic datasets.

Why should we care?

The study demonstrates that existing population WGS datasets are a rich, underused resource for studying viral persistence, a framework that can be extended to other viruses in the human virome, moving beyond serology to direct DNA-based measures. By linking HLA variation to EBV DNA levels via computational peptide presentation scores, the work provides a mechanistic bridge between GWAS hits and immune function—offering a template for how to interpret HLA associations in infectious and

Amplified genome editing by in vivo editor production (NANITE)

Ngo, W. et al. bioRxiv (2026). https://doi.org/10.64898/2026.01.13.699115

The paper in one sentence

Researchers developed a “NANITE” system where cells transfected with a single plasmid become temporary factories, producing lipid vesicles that package and spread CRISPR-Cas9 editing machinery to neighboring cells, significantly amplifying gene editing effects both in lab cultures and in living mice.

Summary

This study introduces NANITE (NANoparticle-Induced Transfer of Enzyme), a novel strategy to overcome a major hurdle in gene therapy: getting the editing machinery into enough cells to have a therapeutic effect. Instead of trying to deliver editors to every target cell, NANITE delivers a single plasmid to a subset of cells. These cells then temporarily produce and secrete engineered lipid vesicles (enveloped delivery vehicles, or EDVs) that carry pre-assembled Cas9 ribonucleoproteins (RNPs). These vesicles travel to and edit neighboring cells, effectively spreading the editing effect. The team showed this amplified editing by ~3-fold in cultured cells and in the livers of mice, where it successfully reduced disease-related protein levels. The system’s targeting can be tuned with antibodies, and it operates transiently without detectable toxicity, offering a promising non-viral approach to boost the efficiency and lower the dose requirements of genome editing therapies.

Personal highlights

In vivo amplification via cellular relay: NANITE repurposes initially transfected cells as in vivo “factories” to produce and secrete editing vesicles, creating a local relay system that spreads CRISPR-Cas9 activity to surrounding, untransfected cells and triples overall editing efficiency.
Single-plasmid simplicity for complex vesicle production: the entire multi-component system for assembling targeted, genome-editing vesicles is condensed into a single deliverable plasmid, dramatically simplifying production and overcoming the in vivo coordination challenges of multi-plasmid approaches.
Modular targeting of editor delivery: the tropism of the therapeutic vesicles can be reprogrammed by incorporating single-chain antibodies, demonstrating that edited cells can be directed to selectively deliver their cargo to specific neighboring cell types based on surface receptor recognition.
Clinically relevant efficacy from low initial transfection: Despite a low percentage of liver cells being initially transfected (~5-8%), NANITE achieved a ~50% reduction in serum transthyretin, a reduction associated with disease improvement in amyloidosis, showcasing how amplifying effects can make suboptimal delivery therapeutically sufficient.
Transient, non-replicative, and non-infectious mechanism: the system is designed for transient expression, shows no evidence of off-target organ editing or genomic integration, and uses non-replicative vesicles, addressing key safety concerns for in vivo gene therapy applications.

Why should we care?

NANITE represents a clever shift in therapeutic delivery. Instead of the immense (and often unsuccessful) struggle to deliver a drug to every single diseased cell, this approach allows us to deliver the instructions to a few cells and let them do the work of distributing the therapy locally. It effectively lowers the bar for how many cells we need to reach directly for a treatment to work

HARMONIC: A Histology-Aware Graph Framework for Cell–Cell Communication Inference in Spatial Transcriptomics

Wang et al. bioRxiv (2026). doi:10.64898/2026.01.22.701166

The paper in one sentence

HARMONIC is a multimodal deep learning framework that integrates spatial transcriptomics (ST) and H&E-stained histology images to infer single-cell-resolved cell–cell communication by causally modeling tissue context, thereby reducing false positives/negatives.

Summary

HARMONIC addresses a key limitation in spatial transcriptomics-based cell–cell communication (CCC) inference: the lack of tissue microenvironmental context. By pairing ST data with H&E images, HARMONIC introduces a graphical causal structure learning module that models how histological features (e.g., physical barriers, cellular morphology) influence transcriptional states and communication likelihood. The method constructs candidate communication graphs from ST-based ligand–receptor co-expression and spatial proximity, then re-weights edges using a confounder-aware directed acyclic graph (DAG) that conditions on H&E-derived microenvironment features. HARMONIC was validated across multiple ST platforms, species, and tissue types, including mouse brain, kidney, and human ovarian cancer—showing improved precision in detecting biologically plausible CCCs, especially in morphologically complex regions like tumor–immune interfaces.

Personal highlights

Causal modeling of tissue context in CCC graphs: HARMONIC formalizes the relationship between cellular transcriptomes, histology-derived microenvironment features, and cellular morphology using a directed acyclic graph (DAG), treating tissue context as a confounder to filter out spurious ligand–receptor co-expression signals.
Confounder-aware edge reweighting via conditional mutual information: the method computes three conditional mutual information (CMI) terms to deconfound molecular coupling from shared microenvironmental context, enabling edge-specific adjustment based on histological plausibility.
DAG-regularized structure learning for global coherence: beyond local edge scoring, HARMONIC enforces global directed acyclicity and structural consistency across the communication graph, ensuring that inferred CCC patterns align with tissue-level organization.
Context-aware evaluation metrics (CDES/CSCS): introduces two novel benchmarks, Contextual Distance Enrichment Score (CDES) and Contextual Spatial Correlation Score (CSCS), that incorporate H&E-derived tissue features into CCC assessment, moving beyond distance- or expression-only metrics.
Demonstrated precision in morphologically complex regions: in tissues with clear histological boundaries (e.g., cortical layers, tumor–stroma interfaces), HARMONIC significantly reduces false positives/negatives compared to ST-only tools, highlighting the value of histology integration in noisy or heterogeneous microenvironments.

Critical note

While HARMONIC’s use of H&E features is methodologically innovative, its benchmarking approach leans heavily on simulated histological barriers and predefined anatomical regions (e.g., cortical layers, kidney zones). This raises questions about generalizability to tissues with less clear-cut morphology or poorer H&E–ST alignment. The proposed metrics (CDES/CSCS) incorporate H&E-derived “cellular complexity” and microenvironment features, but these are still proxy measures—the actual gold standard for CCC (e.g., direct protein interaction imaging) remains absent. Thus, while HARMONIC convincingly shows that H&E context reduces false calls in controlled settings, its real-world performance will depend on the quality and interpretability of histology feature extraction

Other papers that peeked my interest and were added to the purgatory of my “to read” pile

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post

Ready for more?