Weekly reads 23/6/25

Bachelor weekend, five papers, and a future full of predictive biology

Jul 01, 2025

This week’s roundup is a short but dense one—just five papers, posted a couple of days later than usual (bachelor weekend well spent!). From predicting gene regulation with stunning accuracy to mapping ecDNA in over a thousand lung cancers, the week’s selections highlight how AI, scale, and molecular context are pushing single-cell and spatial biology forward. Whether it’s SCHAF inferring transcriptomics from H&E, or Corgi modeling regulatory logic from sequence + TFs, the line between traditional biology and in silico prediction continues to blur. A smaller set this time—but definitely not lighter on insights.

Preprints/articles that I managed to read this week

Examining the Role of Extrachromosomal DNA in 1,216 Lung Cancers

Khandekar et al. (2025). bioRxiv. doi: 10.1101/2025.06.03.657117

The paper in one sentence

This study reveals that extrachromosomal DNA (ecDNA) is prevalent in 18.9% of lung cancers, drives oncogene amplification (e.g., MDM2), and is strongly linked to genomic instability, particularly whole-genome doubling, but shows no significant associations with smoking status, histology, or ancestry.

Summary

The study analyzes 1,216 lung cancer genomes to investigate the role of ecDNA, a circular form of DNA that amplifies oncogenes and contributes to tumor aggressiveness. Key findings include:

ecDNA is present in 17% of never-smokers (LCINS) and 23% of smokers (LCSS), with no significant differences across histologies, ancestries, or geographic regions.
MDM2 is the most frequently amplified oncogene on ecDNA, showing mutual exclusivity with TP53 alterations.
ecDNA is strongly associated with whole-genome doubling (WGD) and chromothripsis, suggesting it arises as a byproduct of genomic instability.
While ecDNA-positive tumors have worse survival, their prognosis is similar to tumors with other focal amplifications.

Personal highlights

ecDNA prevalence independent of smoking or ancestry: unlike other genomic alterations tied to tobacco exposure, ecDNA occurs at similar rates in never-smokers (17%) and smokers (23%), challenging assumptions about its environmental drivers.
Oncogene amplification via ecDNA: MDM2 is the top amplified oncogene on ecDNA, with mutual exclusivity to TP53 alterations—highlighting a key survival mechanism in a subset of lung cancers.
Genomic instability as the primary driver: ecDNA is strongly linked to whole-genome doubling (4x higher odds in LCINS, 5.85x in LCSS) and chromothripsis, positioning it as a consequence—not always a cause—of tumor chaos.
Weak but notable mutation associations: EGFR L858R mutations are enriched in ecDNA+ tumors, while KRAS G12V mutations are depleted, suggesting ecDNA may interact with specific driver pathways.
Clinical implications: ecDNA correlates with poor survival, but not more so than other focal amplifications—emphasizing that genomic instability, not just ecDNA itself, worsens outcomes.

Why should we care?

This work reshapes how we view ecDNA in lung cancer: not as a smoking-related aberration, but as a universal marker of genomic instability that can amplify key oncogenes like MDM2. For biologists, it underscores ecDNA’s role in tumor evolution; for clinicians, it suggests that targeting ecDNA-driven pathways (e.g., MDM2-p53) could benefit a subset of patients. The lack of ties to smoking or ancestry also implies ecDNA’s relevance across diverse populations, urging broader investigation into its therapeutic vulnerabilities

SCHAF: A Deep Learning Framework for Inferring Single-Cell Omics from Histology Images

Comiter et al. (2023). Inference of single cell profiles from histology stains with the Single-Cell omics from Histology Analysis Framework (SCHAF). bioRxiv. doi: 10.1101/2023.03.21.533680.

The paper in one sentence

SCHAF is a deep learning framework that predicts spatially resolved, single-cell transcriptomic profiles directly from standard histology (H&E) images, bridging the gap between routine pathology and high-resolution molecular data.

Summary

SCHAF (Single-Cell omics from Histology Analysis Framework) leverages vision transformers and adversarial deep learning to infer single-cell gene expression profiles from Hematoxylin and Eosin (H&E) stained tissue images. It comes in two variants:

Paired SCHAF: Uses spatial transcriptomics data during training to generate spatially accurate, transcriptome-wide predictions.
Unpaired SCHAF: Requires only scRNA-seq and H&E data, enabling cell-type distribution inference without spatial training data.

The framework demonstrates robust performance across diverse tissues (e.g., cancer, placenta) and offers quality scores to flag reliable predictions. By translating routine histology into rich molecular datasets, SCHAF unlocks new opportunities for research and clinical applications.

Personal highlights

Dual-mode architecture for flexible inference: SCHAF’s paired and unpaired variants accommodate varying data availability—spatial transcriptomics for precise mapping or scRNA-seq alone for broader applicability—making it adaptable to diverse research and clinical settings.
Spatial fidelity with predictive quality scores: paired SCHAF not only infers gene expression but also provides per-gene quality scores (PQS) to flag spatially reliable predictions, enhancing interpretability for downstream analysis.
Adversarial learning for latent space alignment: unpaired SCHAF uses adversarial training to align H&E image features with scRNA-seq profiles in a shared latent space, enabling cell-type distribution inference without spatial data.
Validation across modalities and species: the framework is rigorously validated against experimental data (e.g., Xenium, MERFISH, ISH) and outperforms existing methods in spatial correlation and cell-type accuracy, demonstrating cross-tissue robustness.
Clinically scalable molecular profiling: by generating single-cell omics from H&E—a ubiquitous, low-cost technique—SCHAF democratizes high-resolution molecular analysis, even for archived or resource-limited samples.

Why should we care?

SCHAF transforms routine histology into a gateway for single-cell biology, offering three key advances:

For researchers: It bypasses the cost and complexity of spatial transcriptomics, enabling hypothesis generation and validation using existing H&E archives. The predictive quality scores add a layer of reliability for interpreting results.
For clinicians: The ability to infer molecular profiles from standard pathology slides could enhance diagnostic precision, uncover hidden disease subtypes, and guide personalized therapies—all without additional assays.
For computational biologists: SCHAF’s modular design (paired/unpaired) and integration of foundation models (UNI, scGPT) set a template for cross-modal inference, inspiring applications beyond transcriptomics (e.g., proteomics, epigenetics).

By bridging histology and single-cell biology, SCHAF opens a new frontier: in silico spatial omics, where every H&E slide becomes a potential treasure trove of molecular insights.

Figure. 1 from Comiter et al - Overview of the SCHAF wokflow

scBaseCount: An AI-Powered, Uniformly Processed Single-Cell Data Repository

Youngblut, N. D., Carpenter, C., Prashar, J., et al. (2025). scBaseCount: an AI agent-curated, uniformly processed, and continually expanding single-cell data repository. bioRxiv. doi: https://doi.org/10.1101/2025.02.27.640494

The paper in one sentence

scBaseCount is a groundbreaking, AI-driven repository that automates the discovery, annotation, and standardized processing of single-cell RNA-seq data, offering the largest and most harmonized resource for computational biology and AI model training.

Summary

scBaseCount addresses the challenges of integrating single-cell RNA-seq datasets by leveraging an AI agent (SRAgent) to automate metadata extraction and a standardized pipeline (scRecounter) for uniform data processing. This approach minimizes technical variability, enhances cross-study comparability, and supports diverse analytical needs, from gene expression quantification to RNA velocity. The repository currently spans over 230 million cells across 21 organisms and 72 tissues, making it the largest publicly available resource of its kind.

Personal highlights

AI-Driven Data Curation: SRAgent autonomously identifies, annotates, and processes single-cell datasets from public repositories like SRA, ensuring scalability and consistency in metadata extraction—eliminating the bottleneck of manual curation.
Standardized Processing with scRecounter: a Nextflow-based pipeline reprocesses raw sequencing data uniformly, reducing batch effects and enabling seamless integration of datasets across studies, platforms, and species.
Flexible Analytical Options: scRecounter generates multiple count matrices (e.g., exonic, intronic, spliced/unspliced) to accommodate diverse research needs, from traditional gene expression analysis to dynamic RNA velocity studies.
Technical Artifact Mitigation: by reprocessing datasets with consistent parameters, scBaseCount minimizes confounding technical variation (e.g., library chemistry, suspension type), preserving biological signals for more accurate AI model training.
Continuous Expansion: the repository is designed to grow dynamically with new data, supporting future integration of additional technologies (e.g., multi-omics, spatial transcriptomics) and protected datasets through community collaboration.

Why should we care?

scBaseCount revolutionizes single-cell research by providing a scalable, standardized foundation for computational biology. For AI researchers, it offers a vast, harmonized dataset to train robust models of cellular behavior. For biologists, it enables meta-analyses free from technical noise, uncovering deeper biological insights. By automating curation and processing, the tool democratizes access to high-quality data, accelerating discoveries in development, disease, and beyond. Its flexibility and transparency (open-source code) invite community-driven improvements, ensuring it remains at the forefront of single-cell genomics.

spCLUE: A Unified Contrastive Learning Framework for Spatial Transcriptomics Analysis

Wang, X., Li, W. V., & Li, H. (2025). spCLUE: a contrastive learning approach to unified spatial transcriptomics analysis across single-slice and multi-slice data. Genome Biology, 26, 177. https://doi.org/10.1186/s13059-025-03636-0

The paper in one sentence

spCLUE is a novel contrastive learning framework that integrates multi-view graph networks, attention mechanisms, and batch prompting to unify spatial transcriptomics analysis across single-slice and multi-slice datasets, outperforming existing methods in spatial domain identification.

Summary

spCLUE addresses the challenges of analyzing spatially resolved transcriptomics (SRT) data by combining multi-view graph construction, contrastive learning, and batch correction into a single framework. It constructs separate spatial and gene expression graphs, aligns them using instance- and cluster-level contrastive learning, and integrates them via an attention mechanism. For multi-slice data, a batch prompting module removes technical artifacts while preserving biological signals. Evaluated across diverse SRT platforms (10x Visium, Slide-seqV2, Stereo-seq), spCLUE consistently outperformed nine single-slice and seven multi-slice methods in accuracy and robustness, enabling scalable, interpretable spatial domain discovery.

Personal highlights

Multi-view graph learning: spCLUE constructs separate spatial and gene expression graphs, capturing complementary signals without ad hoc fusion, unlike methods that oversimplify spatial-transcriptional relationships.
Dual contrastive learning: combines instance-level alignment (spot consistency across views) and cluster-level separation (distinct domain formation) to enhance biological coherence and clustering signals simultaneously.
Batch-aware integration: the batch prompting module explicitly models and removes technical variation during training, enabling seamless multi-slice integration without sacrificing biological relevance.
Attention-driven fusion: dynamically weights spatial and expression embeddings via learned attention scores, outperforming single-view or fixed-weight approaches in domain identification.
Platform-agnostic performance: achieves state-of-the-art results across six datasets (DLPFC, BRCA, BARISTA, etc.) and diverse technologies, demonstrating versatility without requiring aligned coordinates or matched protocols.

Why should we care?

spCLUE bridges a critical gap in spatial transcriptomics by providing a unified, scalable solution for both single-slice and multi-slice analysis. For biologists, it offers accurate spatial domain identification with minimal technical bias, revealing tissue organization and disease microenvironments more reliably. For computational researchers, its modular design (open-source code) enables extensions to multi-omics data or histology integratio

Corgi: A Context-Aware Sequence-to-Activity Model for Human Gene Regulation

Aksu, E. D., & Vingron, M. (2025). Context-aware sequence-to-activity model of human gene regulation. bioRxiv. https://doi.org/10.1101/2025.06.25.661447

The paper in one sentence

Corgi is a novel context-aware sequence-to-activity model that integrates DNA sequence and trans-regulator expression to predict genome-wide gene expression, chromatin accessibility, and epigenetic marks across unseen cell types with unprecedented accuracy.

Summary

Corgi overcomes a key limitation of traditional sequence-to-activity models—their inability to generalize beyond training cell types—by incorporating trans-regulator expression (e.g., transcription factors, chromatin modifiers) as a biological context vector. Using a hybrid convolutional-transformer architecture with feature-wise linear modulation (FiLM), Corgi dynamically integrates sequence features (cis-regulatory elements) and trans-regulator activity to predict 16 genomic assays (e.g., RNA-seq, ATAC-seq, histone marks) at 64 bp resolution. Trained on 580 diverse human samples, Corgi achieves experimental-level accuracy in cross-cell type predictions, outperforming state-of-the-art models like Borzoi and EpiGePT in challenging "cross-both" benchmarks (unseen sequences + unseen cell types).

Personal highlights

Biologically inspired architecture: Corgi mirrors cellular gene regulation by combining cis-sequence features (convolutional layers) with trans-regulator context (MLP-processed expression), integrated via FiLM layers—akin to "affinity × concentration" biophysical models.
Generalization to unseen contexts: uniquely predicts epigenetic and transcriptional signals in new cell types (e.g., liver, stem cells) by leveraging trans-regulator expression as a universal cell-state representation, achieving Pearson’s *r* > 0.8 for DNase-seq and RNA-seq.
Multi-assay prediction: simultaneously models 16 assays (e.g., H3K27ac, DNA methylation, CAGE-seq) at 64 bp resolution, enabling imputation of missing data (e.g., ChIP-seq from RNA-seq alone).
Robust benchmarking: outperforms existing models in stringent "cross-both" evaluations (unseen sequences + cell types), with DNA methylation predictions nearing perfection (*r* = 0.93).
Experimental flexibility: enables in silico perturbations of trans-regulators (e.g., CRISPR knockouts) by modulating their expression inputs, bridging computational and wet-lab research.

Why should we care?

Corgi shifts the paradigm in regulatory genomics from static sequence-based predictions to dynamic, context-aware modeling. For biologists, it offers a virtual lab to simulate how changes in TF expression or DNA sequences alter gene regulation—accelerating hypothesis generation for disease mechanisms or developmental processes. For computational researchers, its modular FiLM-based design sets a blueprint for integrating multi-modal data (e.g., spatial transcriptomics, single-cell profiles). By democratizing access to accurate in silico experiments, Corgi reduces reliance on costly epigenomic assays, particularly for rare cell types or patient samples. Its success underscores the power of embedding biological principles—like trans-regulator interplay—into AI architectures.

Figure 1 from Aksu, E. D., & Vingron, M - Overview of CORGI

Other papers that peeked my interest and were added to the purgatory of my “to read” pile

Innate immunity and the NF-κB pathway control prostate stem cell plasticity, reprogramming and tumor initiation: https://www.nature.com/articles/s43018-025-00994-3
MiTo: tracing the phenotypic evolution of somatic cell lineages via mitochondrial single-cell multi-omics: https://www.biorxiv.org/content/10.1101/2025.06.17.660165v1
Comparison of spatial transcriptomics technologies using tumor cryosections: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03624-4
A signaling molecule from intratumor bacteria promotes trastuzumab resistance in breast cancer cells: https://www.pnas.org/doi/10.1073/pnas.2421710122
Biological Reasoning with Reinforcement Learning through Natural Language Enables Generalizable Zero-Shot Cell Type Annotations: https://www.biorxiv.org/content/10.1101/2025.06.17.659642v1
DeepSeq: High–Throughput Single–Cell RNA Sequencing Data Labeling via Web Search–Augmented Agentic Generative AI Foundation Models: https://www.biorxiv.org/content/10.1101/2025.06.17.660107v1
In vivo CAR T cell generation to treat cancer and autoimmune disease: https://www.science.org/doi/10.1126/science.ads8473
spCLUE: a contrastive learning approach to unified spatial transcriptomics analysis across single-slice and multi-slice data: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03636-0
Cholinergic neuronal activity promotes diffuse midline glioma growth through muscarinic signaling: https://www.cell.com/cell/fulltext/S0092-8674%2825%2900618-X
AAnet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity: https://aacrjournals.org/cancerdiscovery/article/doi/10.1158/2159-8290.CD-24-0684/763140/AAnet-resolves-a-continuum-of-spatially-localized
Divergent Evolution of Malignant Subclones Maintains a Balance Between Induced Aggressiveness and Intrinsic Drug Resistance in T Cell Cancer: https://aacrjournals.org/cancerdiscovery/article/doi/10.1158/2159-8290.CD-24-1856/762974/Divergent-Evolution-of-Malignant-Subclones
Nerve-to-cancer transfer of mitochondria during cancer metastasis: https://www.nature.com/articles/s41586-025-09176-8
A spatial atlas of chemoradiation therapy in pancreatic cancer identifies cellular and microenvironmental determinants of persister populations: https://www.biorxiv.org/content/10.1101/2025.06.20.660757v1
Mapping and reprogramming microenvironment-induced cell states in human disease using generative AI: https://www.biorxiv.org/content/10.1101/2025.06.24.661094v1
Quantifying batch effects for individual genes in single-cell data: https://www.researchsquare.com/article/rs-4867545/v1
Extrachromosomal DNA replication and maintenance couple with DNA damage pathway in tumors: https://www.cell.com/cell/fulltext/S0092-8674(25)00414-3?rss=yes

Thanks for reading.

Cheers,

Seb.

Sebcentrism

Discussion about this post