Weekly reads 30/6/25
Precision and perspective: from batch effects to basepair influence in single-cell biology
Week 15, and this one’s packed with big ideas in precision. From pinpointing batch effects at the gene level to decoding crosstalk pathways and visualizing DNA base influence, the common thread across this week's papers is methodological clarity—tools designed not just to perform better, but to explain why and how they work. If last week was about scaling and context, this one was about dissecting complexity—breaking down signals, confounders, and models into interpretable parts. Fewer datasets, more insight.
Preprints/articles that I managed to read this week
Quantifying Batch Effects at the Gene Level in Single-Cell Data with GTE
Jin, S., Zhou, Y., & Sheng, Q. (2024). Quantifying batch effects for individual genes in single-cell data. Research Square. DOI: 10.21203/rs.3.rs-4867545/v1.
The paper in one sentence
The authors introduce group technical effects (GTE), a quantitative metric to measure batch effects for individual genes in single-cell data, revealing that a small subset of highly technical genes (HTGs) drives batch effects and their removal significantly improves data integration.
Summary
Batch effects in single-cell data are typically addressed by aligning cells across batches, but this study shifts focus to gene-level batch effects. The proposed GTE metric quantifies how much each gene contributes to batch variation, identifying HTGs that dominate technical noise. By removing HTGs—often as few as three genes—the method reduces batch effects while preserving biological signal. GTE is versatile, applicable across single-cell RNA-seq, ATAC-seq, and proteomics, and offers a principled approach to feature selection for integration.
Personal highlights
Gene-level batch effect quantification: GTE pinpoints individual genes responsible for batch effects, moving beyond cell-centric correction methods to reveal uneven technical noise across the feature space.
Minimal HTGs, maximal impact: Demonstrates that removing just three high-GTE genes can substantially reduce batch effects, highlighting the outsized role of specific technical artifacts.
Beyond HVGs: Shows that highly variable genes (HVGs) alone don’t capture batch effects, as HTGs often overlap with high-variance genes (e.g., mitochondrial/ribosomal genes) but require targeted removal.
Flexible group variable integration: Works with cell types, sequencing technologies, or even inferred "source" labels, making it adaptable to datasets with or without annotations.
Cross-modal applicability: Successfully applied to scRNA-seq, bulk RNA-seq, scATAC-seq, and proteomics, proving GTE’s utility across diverse omics data types.
Why should we care?
Batch effects are a pervasive headache in single-cell analysis, often muddling biological interpretation. GTE offers a transparent, gene-centric solution: instead of black-box cell alignment, it identifies and removes the handful of genes most responsible for technical noise. For bioinformaticians, this provides a scalable, interpretable tool for cleaner integrations; for experimentalists, it clarifies which genes are trustworthy—or suspect—in cross-batch comparisons. By tackling batch effects at their root (specific genes), GTE bridges the gap between technical artifacts and biological truth, enabling more reliable discoveries in multi-batch studies.
Beyond Visual Inspection: A Quantitative Framework for Evaluating Single-Cell Trajectory Representations
Inecik, K., Rose, A., Haniffa, M., Luecken, M. D., & Theis, F. J. (2025). Beyond Visual Inspection: Principled Benchmarking of Single-Cell Trajectory Representations with scTRAM. bioRxiv. doi: 10.1101/2025.06.23.661141
The paper in one sentence
scTRAM introduces a principled, multi-metric framework to quantitatively evaluate how well low-dimensional single-cell embeddings preserve ground-truth biological trajectories, addressing critical gaps in current qualitative and ad-hoc benchmarking practices
Summary
The paper presents scTRAM (single-cell TRAjectory representation Metrics), a systematic approach to assess the fidelity of single-cell embeddings in preserving trajectory structures—such as cell differentiation or disease progression—across three complementary axes: topological consistency, manifold continuity, and pseudotime alignment. Unlike traditional methods that rely on visual inspection or scalar surrogates, scTRAM decomposes trajectory integrity into localized and global failure modes, enabling granular comparisons of embedding methods. The framework is validated across diverse datasets and models, revealing trade-offs in trajectory preservation (e.g., topology vs. pseudotime accuracy) and demonstrating its utility for model selection, hyperparameter tuning, and downstream biological interpretation.
Personal highlights
Multi-scale trajectory fidelity metrics: scTRAM evaluates embeddings across complementary failure modes—from local neighborhood scrambling to global branch misordering—using a suite of 48 metrics grouped into topological, geometric, and temporal categories, enabling nuanced performance analysis.
Edge-specific performance decomposition: the framework partitions trajectories into segments and edges, revealing localized biases in embedding methods (e.g., scANVI excels in myeloid lineage transitions, while TarDis preserves lymphoid differentiation), which aggregate metrics often obscure.
Beyond scalar scores: by replacing single-number benchmarks with a multi-axis score vector, scTRAM exposes trade-offs inherent to representation learning (e.g., adjacency accuracy vs. pseudotime monotonicity), guiding context-aware model selection.
Integration with training pipelines: the metrics can serve as optimization objectives during model training, enabling embeddings explicitly tailored to preserve biologically critical trajectory features rather than generic integration goals.
Robust validation: controlled experiments (e.g., cell-type ablation, feature dropout) confirm scTRAM’s sensitivity to genuine structural perturbations, distinguishing artifacts from biologically meaningful distortions.
Why should we care?
scTRAM shifts single-cell trajectory analysis from subjective, visualization-heavy evaluation to rigorous, quantitative benchmarking—a critical advance as trajectory-aware embeddings become central to studying development, disease, and immune responses. For computational biologists, it provides a standardized toolkit to compare methods, diagnose model failures, and optimize embeddings for specific biological questions. For practitioners, it offers actionable insights: for example, choosing TarDis for B-cell studies or scANVI for myeloid trajectories. By bridging differential geometry (manifold preservation) and biology (lineage fidelity), scTRAM ensures that computational abstractions faithfully reflect the continuous, branching nature of cellular processes, ultimately enhancing the reliability of downstream discoveries.
CellAgentChat: An Agent-Based Model for Decoding Cell-Cell Interactions from Single-Cell and Spatial Transcriptomics
Raghavan et al. (2025). Genome Research, 35:1646–1663. DOI: 10.1101/gr.279771.124
The paper in one sentence
CellAgentChat introduces an agent-based modeling (ABM) framework to infer and visualize cell-cell interactions (CCIs) from single-cell and spatial transcriptomics data, offering dynamic simulations, in silico perturbations, and resolution of both short- and long-range signaling.
Summary
CellAgentChat addresses limitations in existing CCI inference methods by modeling cells as autonomous agents governed by biologically inspired rules. Unlike population-level approaches, it captures single-cell dynamics, integrates spatial context, and enables in silico perturbations (e.g., receptor blocking) to predict therapeutic targets. Validated across diverse datasets, it outperforms benchmarks in accuracy and offers unique functionalities like animated visualizations and tunable interaction range detection.
Personal highlights
Agent-based modeling for single-cell resolution: CellAgentChat treats each cell as an autonomous agent with gene expression, spatial coordinates, and interaction rules, enabling fine-grained analysis of CCIs beyond cluster-averaged methods.
Spatially informed interaction scoring: Incorporates ligand diffusion rates weighted by distance (adjustable decay parameter δ), allowing explicit modeling of both short- and long-range interactions—validated against known ligand-receptor distance classes.
In silico receptor blocking for therapeutic discovery: Simulates receptor inhibition via a biologically constrained neural network, predicting downstream gene expression changes and identifying clinically relevant targets (e.g., EGFR, PD-1 in breast cancer).
Dynamic visualization of cellular crosstalk: Animated ABM platform visualizes real-time cell-cell communication, highlighting heterogeneity in interaction strengths (e.g., invasive vs. non-invasive tumor cells in PDAC).
Modular and interpretable framework: Combines statistical (permutation tests) and deep learning (regulator conversion rates) components within a unified ABM architecture, balancing flexibility with biological plausibility.
Why should we care?
CellAgentChat shifts the paradigm in CCI analysis from static, population-level summaries to dynamic, mechanistic models. For computational biologists, it offers a versatile ABM framework to test hypotheses about cellular signaling; for translational researchers, it bridges spatial omics and drug discovery by prioritizing targetable receptors. By simulating perturbations and spatial constraints, it opens new avenues to study tissue organization, disease mechanisms, and therapeutic interventions—all while maintaining interpretability and scalability.
SigXTalk: Decoding Cell-Cell Communication Crosstalk with Single-Cell Transcriptomics
Hou, J., Zhao, W., & Nie, Q. (2025). Dissecting crosstalk induced by cell-cell communication using single-cell transcriptomic data. Nature Communications, 16, 5970. https://doi.org/10.1038/s41467-025-61149-7
The paper in one sentence
SigXTalk is a machine learning-based method that quantifies crosstalk between cell-cell communication pathways using single-cell RNA-seq data, introducing the concepts of fidelity and specificity to measure regulatory selectivity.
Summary
SigXTalk addresses the overlooked complexity of crosstalk between pathways activated by cell-cell communication (CCC). By integrating hypergraph neural networks and tree-based machine learning, it systematically identifies shared signaling components (SSCs) and quantifies how CCC signals propagate through intertwined pathways to regulate target genes. The method evaluates fidelity (a pathway’s resistance to off-target signal interference) and specificity (its precision in targeting genes), offering a granular view of intracellular regulatory networks. Benchmarked against 12 existing methods, SigXTalk outperforms in recovering crosstalk pathways and is robust across datasets, enabling applications from disease analysis to temporal tracking of signaling dynamics.
Personal highlights
Hypergraph learning for higher-order regulatory relationships: SigXTalk encodes complex crosstalk by modeling receptors, transcription factors (TFs), and targets as nodes in a hypergraph, capturing multi-way interactions beyond pairwise gene-gene links—critical for dissecting shared pathway components.
Quantifying pathway selectivity via fidelity and specificity: introduces two novel metrics: fidelity measures a pathway’s dominance in regulating its target despite competing signals, while specificity assesses its avoidance of off-target gene activation—key for understanding signal leakage and regulatory precision.
Self-supervised training with limited prior knowledge: the framework trains on sparse regulatory data by leveraging highly correlated gene pairs, enabling robust pathway prediction even when ground-truth interactions are incomplete.
Benchmarked against 12 GRN methods: outperforms existing tools (18% higher AUROC) by explicitly modeling crosstalk, where traditional methods fail due to their focus on linear, pairwise interactions.
Flexible applications across biological contexts: designed for single-cell data, SigXTalk adapts to diverse use cases—from contrasting diseased vs. healthy tissues to tracking temporal signaling shifts—without requiring predefined pathways.
Why should we care?
SigXTalk shifts the focus from whether cells communicate to how signals propagate through tangled intracellular networks. For computational biologists, it offers a scalable, interpretable framework to dissect regulatory crosstalk—a pervasive but understudied phenomenon. For experimentalists, it generates testable hypotheses about key regulators (e.g., TFs like FOS in cancer) and their context-dependent roles. By linking CCC to downstream targets via quantifiable pathways, SigXTalk bridges the gap between ligand-receptor mapping and functional outcomes, with implications for understanding drug resistance, developmental patterning, and cell-type-specific signaling.
TarDis: Achieving Robust and Structured Disentanglement of Multiple Covariates in Single-Cell Genomics
Inecik et al. (2024). bioRxiv. doi: 10.1101/2024.08.20.509903
The paper in one sentence
TarDis is a deep generative model that systematically disentangles categorical and continuous covariates into independent latent dimensions, enabling interpretable and robust analysis of single-cell genomics data while preserving biological signals.
Summary
TarDis addresses the challenge of covariate disentanglement in single-cell genomics by introducing a tailored variational autoencoder (VAE) framework. The model employs covariate-specific loss functions to isolate technical (e.g., batch effects) and biological (e.g., cell type, developmental stage) factors into distinct latent subspaces. Key innovations include:
Targeted disentanglement: Explicitly separates covariates (e.g., age, drug dosage) into reserved latent dimensions while leaving residual variation in an unreserved subspace.
Handling continuous covariates: Uses distance-weighted losses to maintain ordered representations (e.g., gradients in pseudotime or dosage responses).
Scalability: Outperforms existing methods in data integration, out-of-distribution generalization, and interpretability across diverse datasets.
By decoupling confounding factors, TarDis enables clearer biological insights and hypothesis generation from complex single-cell data.
Personal highlights
Structured disentanglement of mixed covariates: TarDis uniquely handles both categorical (e.g., cell type) and continuous (e.g., pseudotime) covariates simultaneously, isolating each into dedicated latent dimensions while avoiding information leakage.
Distance-based losses for continuous variables: unlike methods that discretize continuous covariates, TarDis preserves their intrinsic ordering—critical for modeling gradients like dose-response curves or developmental trajectories.
Hypothesis-driven covariate selection: the model encourages users to prioritize covariates aligned with research questions, avoiding overcorrection and ensuring biologically meaningful latent representations.
Robust out-of-distribution predictions: by cleanly separating covariate effects, TarDis generalizes to unseen conditions (e.g., predicting cellular responses to novel drug dosages), outperforming benchmarks like CPA.
Interpretable latent spaces: the reserved subspaces align with known biological or technical factors (e.g., batch effects), while the unreserved space captures nuanced variation, facilitating downstream analysis.
Why should we care?
TarDis bridges a gap in single-cell analysis by not just correcting for confounders but structuring them into interpretable components. For computational biologists, it offers a flexible framework to dissect complex datasets without losing signal to overcorrection. For wet-lab researchers, it generates testable hypotheses—like how specific genetic variants modulate drug responses across cell types. By turning "noise" into actionable axes of variation, TarDis empowers precision in modeling cellular heterogeneity, with implications for disease research, therapeutic development, and beyond.
Decoding Genomic Rules with PISA: A Versatile Tool for Visualizing cis-Regulatory Logic
McAnany, C. E., Weilert, M., Mehta, G., Kamulegeya, F., Gardner, J. M., Kundaje, A., & Zeitlinger, J. (2025). PISA: a versatile interpretation tool for visualizing cis-regulatory rules in genomic data. bioRxiv. doi: https://doi.org/10.1101/2025.04.07.647613
The paper in one sentence
PISA (pairwise influence by sequence attribution) is a novel deep learning interpretation tool that visualizes how individual DNA bases influence genomic readouts at single-base resolution, enabling precise dissection of cis-regulatory rules and experimental biases in sequence-to-function models.
Summary
The paper introduces PISA, a computational method designed to interpret sequence-to-function neural networks by quantifying and visualizing the pairwise influence of each input DNA base on every output position in genomic data. Integrated into the BPReveal framework, PISA generates two intuitive plot types—squid plots and heatmaps—to reveal the spatial range and directionality of motif effects, distinguish biological signals from experimental biases, and enable bias-corrected modeling of complex assays like MNase-seq. By leveraging Shapley values for base-resolution attribution, PISA uncovers previously hidden regulatory patterns, such as motifs with mixed positive/negative contributions, and facilitates synthetic sequence design with tailored nucleosome positioning.
Personal highlights
Base-resolution attribution for cis-regulatory logic: PISA assigns Shapley values to each input-output base pair, creating a 2D matrix (ℙᵢ→ⱼ) that quantifies how DNA sequence influences predictions at individual genomic coordinates—revealing motifs with spatially opposing effects that cancel out in traditional methods.
Bias-aware modeling with ChromBPNet integration: PISA heatmaps explicitly distinguish enzymatic biases (diagonal patterns) from biological signals (vertical bands), enabling the derivation of synthetic bias tracks for assays like MNase-seq where control datasets are unavailable.
Scalable framework for diverse genomics assays: Implemented in BPReveal, PISA generalizes across ChIP-seq, ATAC-seq, and nucleosome mapping data, offering a unified platform to dissect sequence rules while controlling for technical artifacts.
From interpretation to design: PISA’s genetic algorithm leverages model insights to engineer sequences with altered nucleosome configurations, bridging the gap between computational prediction and experimental validation.
Why should we care?
PISA transforms "black box" genomic deep learning models into interpretable engines for biological discovery. For computational biologists, it provides a rigorous method to extract base-level regulatory logic and correct biases; for experimentalists, it offers testable hypotheses about motif function and sequence design. By making model interpretations spatially explicit, PISA advances efforts to decode the cis-regulatory code—critical for understanding genetic diseases, designing synthetic biology constructs, and refining functional genomics assays.
Spatial-DMT: A Breakthrough Method for Simultaneous Spatial Profiling of DNA Methylome and Transcriptome
Lee, C. N., Fu, H., Cardilla, A., Zhou, W., & Deng, Y. (2025). Spatial joint profiling of DNA methylome and transcriptome in mammalian tissues. bioRxiv. doi: 10.1101/2025.07.01.662607
The paper in one sentence
Spatial-DMT is a novel technology enabling high-resolution, simultaneous spatial mapping of DNA methylation and gene expression in intact tissues, offering unprecedented insights into epigenetic regulation within native tissue contexts.
Summary
The study introduces Spatial-DMT, a pioneering method that combines microfluidic in situ barcoding, enzymatic methylation sequencing, and high-throughput sequencing to co-profile DNA methylation and transcriptome in the same tissue section at near single-cell resolution. This approach overcomes the limitations of existing spatial omics technologies, which lack the ability to directly measure DNA methylation spatially. Applied to mouse embryogenesis and postnatal brain tissues, Spatial-DMT generates rich, reproducible datasets that reveal intricate spatiotemporal relationships between epigenetic modifications and gene expression, providing a powerful tool for studying development, disease, and tissue biology.
Personal highlights
Dual-modality spatial profiling: Spatial-DMT simultaneously captures DNA methylation and transcriptome data from the same tissue section, enabling direct correlation of epigenetic states with gene expression in their native spatial context.
Enzymatic methylation sequencing: the method employs an enzyme-based alternative to bisulfite conversion (Enzymatic Methyl-seq), minimizing DNA damage while achieving high-quality methylation profiling comparable to single-cell studies.
Microfluidic in situ barcoding: utilizes a two-dimensional grid of spatially barcoded pixels (up to 2,500 unique combinations), allowing precise localization of epigenetic and transcriptional signals within tissues.
Near single-cell resolution: achieves spatial mapping at 10 μm pixel resolution, revealing fine-scale epigenetic and transcriptional heterogeneity in complex tissues like the developing brain.
Why should we care?
Spatial-DMT represents a transformative leap in spatial omics by bridging the gap between epigenetic regulation and gene expression in intact tissues. For researchers, it offers a robust, reproducible platform to study how DNA methylation shapes cellular identity and function within their native microenvironment—critical for understanding development, aging, and disease. For clinicians, the method’s potential application to FFPE tissues could unlock new epigenetic biomarkers for precision medicine. By preserving spatial context, Spatial-DMT moves beyond bulk or single-cell assays, providing a systems-level view of gene regulation that is both mechanistic and actionable.
Other papers that peeked my interest and were added to the purgatory of my “to read” pile
Beyond benchmarking: an expert-guided consensus approach to spatially aware clustering
Tracing colorectal malignancy transformation from cell to tissue scale
Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights
Tracing the Shared Foundations of Gene Expression and Chromatin Structure
Rewriting regulatory DNA to dissect and reprogram gene expression
Facilitate integrated analysis of single cell multiomic data by binarizing gene expression values
Simultaneous epigenomic profiling and regulatory activity measurement using e2MPRA
Gene context drift identifies drug targets to mitigate cancer treatment resistance
Clonal evolution of hematopoietic stem cells after autologous stem cell transplantation
The mutagenic forces shaping the genomes of lung cancer in never smokers
Reactivation of mammalian regeneration by turning on an evolutionarily disabled genetic switch
Thanks for reading.
Cheers,
Seb.