Latest bioRxiv papers
Category: bioinformatics — Showing 50 items
Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease
Rafiee, M.; Abaj, F.; Mahdevar, M.; Rashidian, A.; Ghaedi, K.; Ghiasvand, R.Abstract
Inflammation and oxidative stress (OS) are key to Parkinson's disease (PD). We performed a cross-dataset integrative transcriptomic analysis to identify OS and inflammation-related hub genes persistently dysregulated in PD and to evaluate their response to nutrigenomic interventions using publicly available datasets. Four GEO datasets (GSE7621, GSE20141, GSE20146, GSE49036) were analysed to identify differentially expressed genes (DEGs), which were intersected with GeneCards OS inflammation gene sets. Functional enrichment analyses, including gene ontology (GO), pathway over-representation analysis (ORA), and protein-protein interaction (PPI) analysis, were used to identify key pathways and hub genes. Gene food bioactive compound (FBC) association was explored by integrating PD signatures with nutrigenomic profiles from NutriGenomeDB. We identified 183 DEGs in PD, enriched in synaptic, dopaminergic, OS, and inflammatory pathways. Intersection analysis yielded 26 OS-inflammation-related genes and 10 central regulators, including TH, DDC, SNCA, LRRK2, HSPB1, and HSPA1B. revealed opposing transcriptional patterns, with several FBCs suppressing stress related genes and upregulating dopaminergic markers such as TH, GCH1, and DDC. Overall, this integrative analysis highlights OS inflammation gene networks in PD and identifies candidate diet gene interactions that warrant further experimental validation
bioinformatics2026-05-09v1A Fractal-Dimension Framework for Quantifying Self-Similarity in Chromatin Folding
El-Yaagoubi, A.; Balubaid, A. O.; Chung, M. K.; tegner, j.; Ombao, H.Abstract
The three-dimensional folding of DNA is essential for genome function, but its organization remains difficult to summarize quantitatively across genomic scales. Here, we study DNA folding from Hi-C contact data using a network-based notion of fractal dimension. In this representation, genomic loci are treated as nodes, and observed Hi-C contacts define weighted edges, so that frequently interacting loci are closer in the resulting network. We then estimate fractal dimension using two complementary graph-based methods: the correlation dimension and the sandbox dimension. Validation on synthetic networks shows that the proposed estimators detect clear scaling behavior in hierarchical fractal-like networks, while distinguishing them from networks with local clustering but no stable multiscale self-similarity. Applied to intrachromosomal Hi-C data from the IMR90 human cell line, the method reveals approximate linear scaling regimes on log-log plots, suggesting fractal-like organization in chromatin contact networks. At the chromosome level, estimated fractal dimension tends to increase with chromosome size: larger chromosomes often have dimensions closer to 3, consistent with more compact and space-filling organization, whereas shorter chromosomes tend to have lower dimensions, closer to 1, consistent with simpler and more open folding patterns. A sliding-window analysis at 5 kb resolution further shows that fractal organization varies substantially along chromosomes rather than remaining uniform across genomic position. These results suggest that graph-based fractal dimension provides an interpretable summary of DNA folding complexity at both global and local scales. More broadly, the proposed framework offers a quantitative way to study multiscale genome organization from Hi-C data using tools from network geometry.
bioinformatics2026-05-09v1Building an open ecosystem for molecular neuroimaging: standards and tools from the OpenNeuroPET initiative
Ganz, M.; Norgaard, M.; Pernet, C.; Matheson, G. J.; Galassi, A.; Ceballos, E. G.; Wighton, P.; Bilgel, M.; Eierud, C.; Gonzalez-Escamilla, G.; Buckholtz, J.; Blair, R.; Markiewicz, C. J.; Hardcastle, N.; Greve, D. N.; Thomas, A. G.; Poldrack, R. A.; Calhoun, V. D.; Innis, R. B.; Knudsen, G. M.Abstract
Molecular neuroimaging with positron emission tomography (PET) and single-photon emission computed tomography (SPECT) enables quantification of specific molecular targets in the living brain. Despite its scientific impact, molecular neuroimaging research has historically faced challenges due to high costs, small sample sizes, laboratory-specific analysis pipelines, and limited large-scale data sharing. These factors have hindered reproducibility and the broader reuse of valuable PET datasets. The OpenNeuroPET initiative was established to address these barriers by developing standards, infrastructure, and open-source tools for organizing, sharing, and analyzing molecular neuroimaging data. Through collaborations across Europe and North America, OpenNeuroPET has supported the PET extension of the Brain Imaging Data Structure (PET-BIDS), providing a standardized framework for PET datasets and metadata. Building on PET-BIDS, tools such as PET2BIDS, ezBIDS, and BIDSCoin facilitate data conversion and curation. In parallel, OpenNeuro now hosts PET-BIDS datasets for open sharing, while complementary platforms such as PublicnEUro enable GDPR-compliant controlled access. Emerging open-source workflows and BIDS applications further support automated, reproducible PET preprocessing and quantitative analysis, promoting harmonized processing across centers. Together, these developments mark an important step toward an open molecular neuroimaging ecosystem in which datasets, software, and workflows can be transparently shared, reused, and scaled for collaborative research.
bioinformatics2026-05-09v1A structural grammar of truncation across the human homodimer landscape
Karagöl, T.; Karagöl, A.Abstract
Alternative splicing and proteolytic truncation generate tens of thousands of protein isoforms in the human proteome, but the structural consequences for quaternary state, the level at which most signaling, enzymatic and regulatory function operates, have largely been examined one molecule at a time. Leveraging the recent expansion of the AlphaFold Database to predicted human homodimers, we systematically compared 5,168 canonical-versus-truncated homodimer pairs across the human proteome. In high-confidence canonical homodimers, truncation is associated with predicted structural conservation in 56.4% of pairs (mean 85 residues lost), complete interface ablation in 26.1% (mean 178 residues lost), and partial destabilization in 17.5% (mean 134 residues lost); a distinct fourth class (4.0% of the dataset, n = 208) shows truncation-associated emergence of a predicted high-confidence interface from a sub-threshold canonical baseline. Two reproducible rules govern these transitions: a topological asymmetry in which N-terminal losses are preferentially enriched ~1.6-fold in interface preservation while C-terminal losses are rare overall (~6% of pairs) and modestly under-represented in the conservation class, and a biophysical rule in which emergence-class proteins show substantially elevated intrinsic disorder content relative to ablation-class proteins, as measured by both AlphaFold pLDDT-defined disorder of the canonical structure (Cohen's d {approx} 1.39) and AIUPred peak binding propensity of the truncated isoform (Cohen's d {approx} 0.65). Formal pathway enrichment recovered only a small nucleotide-metabolism signal, indicating that these rules operate across diverse gene-functional categories. Truncation-associated remodeling of homodimer architecture thus constitutes a structural grammar of the human proteome rather than a specialty of any single regulatory family.
bioinformatics2026-05-09v1Machine learning cross-platform proteomic imputation enables protein quality scoring and replication of epidemiological associations
Li, L.; Alaa, A.; Tan, Y.; Demirel, I.; Friedman, S.; Zha, Q.; Trac, R. P.; Taylor, K. D.; Yu, B.; Ballantyne, C. M.; Deo, R.; Dubin, R.; Tsai, M. Y.; Peloso, G. M.; Brody, J.; Austin, T.; Psaty, B. M.; Nicholas, J.; Raffield, L. M.; Tahir, U.; Coresh, J.; Hornsby, W.; Chan, A.; Rich, S. S.; Rotter, J. I.; Ganz, P.; Gerszten, R.; Philippakis, A.; Natarajan, P.; Yu, Z.Abstract
High-throughput affinity-based proteomics has advanced biomedical research, yet fundamental, persistent discordance between mainstream platforms (SomaScan and Olink) routinely undermines the replication of findings. This platform-driven non-replication complicates downstream biological validation and biomarker prioritization. Here, we develop a machine learning-based framework for cross-platform protein value imputation to resolve this translational bottleneck. Using paired proteomic data measured by both SomaScan and Olink from 5,325 participants of the Multi-Ethnic Study of Atherosclerosis, we developed models to impute cross-platform measurements and applied them to two independent and demographically distinct cohorts (Cardiovascular Health Study [N=3,171] and UK Biobank [UKB; N=41,405]) for external validation. Our bi-directional model 1) established an imputation performance-based protein fidelity index, validated against gold-standard measurements from Atherosclerosis Risk in Communities study (N=101) and Nurses' Health Study (N=54), 2) enabled imputation of platform-exclusive protein measurements, and 3) facilitated calibration of overlapping proteins. We demonstrate the utility of this framework through three applications: 1) fidelity-informed analyses enhanced the replication of biomarker discovery, 2) recovery of SomaScan signals that were previously inaccessible in UKB's original Olink measurements, and 3) improved replication performance for overlapping proteins. Our study offers a translational roadmap that allows researchers to achieve reliable epidemiological replication, target specific assays for future optimization, and prioritize biological signal over platform noise.
bioinformatics2026-05-09v1A novel phylogenomics pipeline reveals extensive topological conflict in the evolution of the angiosperm order Cucurbitales
Ortiz, E. M.; Hoewener, A.; Shigita, G.; Raza, M.; Maurin, O.; Zuntini, A.; Forest, F.; Baker, W. J.; Schaefer, H.Abstract
High-throughput sequencing data, such as target capture, RNA-Seq, genome skimming, and high-depth whole genome sequencing, are used for phylogenomic analyses. Integrating these mixed data types into a single phylogenomic dataset requires several bioinformatic tools and significant computational resources. Here, we present Captus, a novel pipeline to analyze mixed data efficiently. Captus assembles these data types, searches for loci of interest, and produces paralog-filtered alignments. If reference target loci are not available for the studied taxon, Captus can also be used to discover new putative homologs via sequence clustering. Compared to other software, Captus allows the recovery of a greater number of more complete loci across more species. We apply Captus to assemble a comprehensive dataset, comprising the four types of sequencing data for the angiosperm order Cucurbitales, a clade of about 3,100 species in eight mainly tropical plant families, including begonias (Begoniaceae) and gourds (Cucurbitaceae). Our phylogenomic results support the currently accepted circumscription of Cucurbitales except for the position of the holoparasitic Apodanthaceae, which group with Rafflesiaceae in Malpighiales. A subset of mitochondrial gene regions supports the earlier divergence of Apodanthaceae in Cucurbitales. However, the nuclear regions and majority of mitochondrial regions place Apodanthaceae in Malpighiales. Within Cucurbitaceae, we confirm the monophyly of all currently accepted tribes but also reveal hybridization and incomplete lineage sorting both in Cucurbitales and within Cucurbitaceae. We show that contradicting results among earlier phylogenetic studies in Cucurbitales can be reconciled when accounting for gene tree conflict and demonstrate the efficiency of Captus for complex datasets.
bioinformatics2026-05-08v4RiboPipe: efficient per-transcript codon-resolution ribo-seq coverage imputation for low-coverage transcripts
Zhang, Y.-z.; Hashimoto, S.; Li, S.; Inada, T.; Imoto, S.Abstract
Motivation: Ribosome profiling (Ribo-seq) provides codon-resolution measurements of translation; however, many transcripts exhibit sparse or low read coverage, which limits downstream quantitative analyses. Reliable prediction and imputation of codon-resolution coverage for low-coverage transcripts remain computationally challenging. Results: We present RiboPipe, an efficient framework for per-transcript codon-resolution Ribo-seq coverage imputation for low-coverage transcripts. RiboPipe is designed around three key principles. First, it jointly optimizes transcript-level mean ribosome load (MRL) prediction and codon-level coverage modeling within a unified objective, enabling consistent learning across both local and transcript-level scales. Second, it introduces a peak-weighted loss that emphasizes high-signal codon positions associated with translational pausing, improving the recovery of functionally relevant coverage peaks. Third, the framework is lightweight and data-efficient, achieving stable performance even when trained on only a small fraction of high-coverage transcripts. Using two publicly available Ribo-seq datasets (GSE233886 and GSE133393), we demonstrate stable convergence and consistent prediction accuracy across multiple train-test split ratios. Comparative evaluation of embedding strategies shows that simple one-hot representations achieve competitive or even superior performance compared with pre-trained language model embeddings under identical training conditions. Overall, RiboPipe provides a computationally efficient and scalable framework for Ribo-seq coverage imputation in low-coverage transcripts. Availability and Implementation: The source code and associated data can be accessed at https://github.com/yaozhong/riboPipe
bioinformatics2026-05-08v2Predicting Enzyme pH Optima from Structure Using Equivariant Graph Neural Networks
SinhaRoy, R.; Clauss, C.; Ivanikov, I.; Kuenze, G.Abstract
Enzyme activity and stability are strongly modulated by pH, making the catalytic pH optimum (pH opt) a key parameter in enzyme development and biotechnological applications. Experimental determination of pH opt is, however, labor-intensive and time-consuming, motivating the development of accurate computational prediction methods. Here, we introduce pHoptNN, an E (n)-equivariant graph neural network designed to predict enzyme pH opt directly from three-dimensional protein structures. pHoptNN was trained on a curated dataset comprising nearly 12,000 enzymes with experimentally determined pH opt values and high-confidence structural models obtained from the Protein Data Bank and AlphaFold3. The model represents enzymes as atomic-level molecular graphs, integrating structural, chemical, and electrostatic features. Model development was guided by extensive hyperparameter optimization using genetic and Bayesian search strategies. On a held-out test set, pHoptNN achieved a root-mean-square error (RMSE) of 0.588 pH units, substantially outperforming the sequence-based method EpHod (RMSE = 0.879). Moreover, pHoptNN maintains robust predictive performance across different enzyme classes and pH ranges. These results demonstrate the utility of structure-based equivariant deep learning for enzyme pH opt prediction and highlight the potential of pHoptNN to accelerate enzyme discovery and engineering workflows.
bioinformatics2026-05-08v2scStudio: A User-Friendly Web Application Empowering Non-Computational Users with Intuitive scRNA-seq Data Analysis
Bica, M.; Serre, K.; Barbosa-Morais, N. L.Abstract
Background Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity by providing detailed insights into gene expression at the individual cell level. Despite its potential, the complexity of scRNA-seq data analysis often poses challenges for researchers without computational expertise. Findings To address this, we developed scStudio, a user-friendly, comprehensive, and modular web-based application designed to democratize scRNA-seq data analysis. scStudio is equipped with a suite of features designed to streamline data retrieval and analysis with both flexibility and ease, including automated dataset retrieval from the Gene Expression Omnibus. Users can also upload their own datasets in a variety of formats, integrate multiple datasets, and tailor their analyses using a wide range of flexible methods with options for parameter optimization. The application supports all the essential steps required for scRNA-seq data analysis, including in-depth quality control, normalization, dimensionality reduction, clustering, differential expression, and functional enrichment analysis. scStudio also tracks the history of analyses, supports session data storage and export, and facilitates collaboration through data sharing features. Conclusion By developing scStudio as a user-friendly interface and scalable architecture, we address the evolving needs of scRNA-seq research, making advanced data analysis accessible and manageable while accommodating future developments in the field. scStudio is freely available at https://compbio.imm.medicina.ulisboa.pt/app/scStudio.
bioinformatics2026-05-08v2SPPIDER-seq: Sequence-based partner-aware predictor of protein-protein interaction sites
Porollo, A.; Jadhav, O.; Alvarez, A.; Chen, J.Abstract
Motivation: Sequence-based protein-protein interaction (PPI) site predictors typically analyze proteins in isolation, neglecting partner-specific context that is critical for interface specificity, particularly in transient and disordered interactions. Results: We introduce SPPIDER-seq, a partner-aware PPI site prediction framework that combines pretrained ESM-2 embeddings with a cross-attention architecture to enable residue-level conditioning on interacting partners. We curated non-redundant protein-peptide interaction datasets from BioLiP and used them to train and benchmark two complementary models: a receptor-centric model optimized for structured interfaces and a peptide-centric model tailored to disordered, motif-driven binding. On blind benchmarks, SPPIDER-seq achieved AUROC values up to 0.797 and MCC values up to 0.269, outperforming AlphaFold3 on peptide-mediated and disordered interfaces while remaining complementary on globular complexes. Application to 341 TP53 interaction partners revealed coherent, partner-specific interface patterns across both structured and intrinsically disordered regions. Availability and Implementation: SPPIDER-seq models, datasets, and the Python code are freely available on the web at: https://github.com/aporollo-lab/SPPIDER-seq
bioinformatics2026-05-08v2Simple baselines rival protein language models in mutation-dense design of function tasks
Talpir, I.; Fleishman, S. J.Abstract
Computational protein design demands generally applicable models that reliably predict or generate unmeasured variants with superior functional properties. Although protein language models (pLMs) have been used in zero-shot and transfer-learning design studies, they have generally not been assessed in benchmarks that explicitly test combinatorial extrapolation from lower- to higher-order variants. Here we benchmark widely used pLMs against conventional baseline methods in recently described dense, experimentally validated multi-mutant landscapes. We find that regardless of architecture and parameter count, pLMs are statistically similar to one another, and none consistently outperforms conventional baseline methods. Furthermore, their ability to distinguish functional from non-functional variants in zero-shot prediction is comparable to that of conventional homology-based methods. We suggest that to contribute significantly to the design of protein function, pLMs may need to encode biophysical and structural priors or be combined with structure-based approaches.
bioinformatics2026-05-08v2Sample-level modeling of single-cell data at scale with tinydenseR
Milanez-Almeida, P.; Schildknecht, D.; Linder, M.; Brachmann, S. M.; Weiss, A.; Adler, F.; Lenticchia, S. C.; Meistertzheim, M.; Wild, S.; Cuttat, R.; Jayaraman, P.; Lee, L. H.; Mulvey, T.; Hassounah, N.; Crafts, G.; Quinn, D. S.; Orlando, E. J.Abstract
Single-cell studies now routinely encompass hundreds of samples and millions of cells, offering unprecedented opportunities to link sample-level phenotypes with cellular and molecular states. However, current workflows often depend on cell-level inference and rigid clustering, which can distort significance and obscure subtle, continuous variation, in particular for complex experimental designs. Here, we present tinydenseR, a clustering-independent framework that enables robust, scalable, and statistically sensitive detection of differential cell states, outperforming existing workflows in speed, memory usage, and biological resolution. Technology-agnostic at its core, tinydenseR works seamlessly on scRNA-seq, flow, mass and spectral cytometry. Across synthetic benchmarks, a preclinical xenograft model, two immuno-oncology trials and a multi-study atlas, tinydenseR uncovers disease and treatment history-associated effects, including subtle within-cluster heterogeneity. Designed to accelerate discovery in clinical, preclinical, and translational research, the open-source package is available at GitHub.com/Novartis/tinydenseR.
bioinformatics2026-05-08v2Denoised MDS-UPDRS Part-III Scores Yield New Patterns of Progression Heterogeneity in Early Stage Parkinson's Disease
Koss, J.; Tinaz, S.; Tagare, H.Abstract
Parkinson's Disease (PD) Motor Scores (MDS-UPDRS Part III) are quite noisy. This paper proposes a new methodology for processing these scores by first denoising the scores to enhance the underlying progression signal, and then conducting a high-dimensional analysis which does not sum the scores into a total movement score. The analysis gives novel insights into PD progression heterogeneity: it reveals that the heterogeneity is continuously variable rather than clustered into "subtypes" and that the variability is along two easily understood axes. This analysis also resolves some of the discrepancies in previously reported progression subtypes. Finally, the analysis reveals that patient-specific progression cannot be predicted from baseline using only MDS-UPDRS Part III scores.
bioinformatics2026-05-08v1QuadStack: Specialized convolutional blocks enable in vivo BG4-binding motif prediction and highlight discrepancies with in vitro G-quadruplexes.
Ulas, P. N.; Doluca, O.Abstract
G-quadruplex (G4) prediction has been largely guided by in vitro biophysical rules, yet these models show limited agreement with in vivo measurements. Here, we present QuadStack, a deep learning model trained on a multi study BG4-ChIP-seq compendium. QuadStack introduces two biologically grounded convolutional modules-G4Stack Convolution, which captures G/C stacking patterns, and Reverse Complement Convolution, which enforces strand invariant representations consistent with ChIP-seq signals. QuadStack achieves strong predictive performance (AUC up to 0.94) and substantially outperforms widely used in vitro-based predictors on genomic test data. Beyond performance, our analyses reveal that BG4-associated sequence grammar is not solely governed by canonical isolated G-rich tracts, but also by patterns where G and C nucleotides are mixed. This suggests that cytosines are not simply disruptive in vivo, and raises the possibility that cytosines may play a context-dependent role or that guanines on the opposite strand contribute to the structure, which could explain the difference between in vivo and in vitro observations. Together these findings demonstrate a fundamental discrepancy between in vitro folding propensity and in vivo G4 biology, and establish QuadStack as both a predictive model and a framework for interpreting G4 formation in its native genomic context.
bioinformatics2026-05-08v1Allosteric Protein Chemical Shift Perturbations are Ubiquitous
Benavides, T. L.; Ramelot, T. A.; Montelione, G. T.Abstract
While allosteric protein function has been appreciated for decades, the ubiquity of conformational shifts, particularly those distant from the interaction interface, has not been broadly characterized. For example, ligand binding frequently triggers allosteric effects far from the interaction interface, yet the prevalence of these conformational shifts underpinning protein function remain poorly documented. We systematically assessed the generality of allosteric effects as monitored by NMR Chemical Shift Perturbations (CSPs) distant from the interaction interface. In a set of 139 protein-protein complexes, a striking 74% of all significant CSPs are non-local to the binding site. Notably, more than 35% of significant CSPs outside the binding site occur in residues for which the shortest receptor-ligand interatomic distance is more than 10 [A]. Every protein analyzed exhibits a significant fraction (> 8%) of CSPs distant from the binding site. This analysis across a large number of protein structures demonstrates and documents that structural plasticity is a ubiquitous and fundamental property of proteins.
bioinformatics2026-05-08v1LongAllele: a joint inference framework for allele-specific analysis on long-read bulk and single-cell RNA sequencing
Xu, Z.; Wang, K.Abstract
Allele-specific analysis from RNA-seq is a powerful approach to characterize cis-regulatory effects. However, existing methods remain limited in both haplotype inference and allelic testing. Their haplotype-inference workflows separate variant calling, haplotype phasing, and read-haplotype assignment into sequential steps, failing to fully exploit within-read SNV linkage information and propagating errors into downstream allelic analysis. At the testing stage, they ignore non-phasable reads lacking heterozygous SNVs, biasing calls and inflating false positives, and remain incomplete across gene-, isoform-, and local-event-level variant effects. Here, we present LongAllele, a statistical framework that employs an expectation-maximization algorithm to jointly infer heterozygous variants, haplotype structure, and read-haplotype assignments from long-read bulk and single-cell RNA sequencing. LongAllele further introduces phasability-aware testing that explicitly accounts for non-phasable reads, avoiding inflated false-positive calls when haplotype information is incomplete. It also enables comprehensive allelic testing across gene-level ASE, isoform-level allele-specific transcript usage (ASTU), and local-event-level haplotype-associated exon and junction usage (HAEU and HAJU), providing a multi-scale view of cis-regulation. We applied LongAllele to long-read RNA-seq datasets spanning GTEx (multi-tissue bulk), peripheral blood mononuclear cells (single-cell), and human hippocampus (single-nucleus). LongAllele consistently revealed greater tissue and cell-type variability in expression-level than isoform-level allelic regulation, pinpointed high-impact regulatory variants including rare splice-site mutations missed by standalone variant callers, and showed that purifying selection constrains allelic imbalance at both gene and isoform levels. LongAllele offers a unified framework for haplotype-resolved cis-regulatory analysis across diverse cellular contexts.
bioinformatics2026-05-08v1Efficient Stochastic Trace Generation for Transcription
Ferdowsi, A.; Fuegger, M.; Nowak, T.Abstract
Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespie's algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.
bioinformatics2026-05-08v1SaVanache: indexing and visualizing pangenome variation graphs
Mohamed, M.; Durant, E.; Rouard, M.; Muller, C.; Monat, C.; Conte, M.; Sabot, F.Abstract
With the rapid increase in genome sequencing and the growing availability of genomic resources, genomics is shifting toward pangenome representations that capture intra- and inter-specific diversity by integrating multiple genomes into a single entity. These pangenomes are increasingly modeled as graphs, encoding complex genomic variations in structures such as de Bruijn or variation graphs. However, while genome browsers provide standard and effective solutions for visualizing single or limited numbers of genomes, equivalent interactive tools for graph-based pangenomes remain limited, particularly for variation graph models. We developed SaVanache, a multi-resolution visualization interface designed to explore pangenome variation graphs at various depths. SaVanache enables the exploration of both global diversity and structural variations (SVs) across genomes relative to a user-defined linear pivot genome. Unlike synteny viewers, SaVanache emphasizes variations by representing SV types through a dedicated set of glyphs, facilitating intuitive one-to-many comparisons. To support smooth exploration, SaVanache preprocesses a Graphical Fragment Assembly (GFA) pangenome file into optimized index and data structures, enabling fast, real-time queries on large pangenome graphs. By combining advanced visualization techniques with efficient data handling, SaVanache provides a robust tool for scientists to analyze and visualize genetic variation within genomes and pangenomes, facilitating the identification of genetic determinants associated with phenotypes of interest and fully exploiting current genomic resources.
bioinformatics2026-05-08v1A Differentiable dFBA Simulator for Scalable Bayesian Inference over Microbial Metabolic Models
Diederen, T.; Merzbacher, C.; Patz, M.Abstract
Medium optimisation for bioprocess design remains challenging and costly: fermentation recipes typically contain ten or more components, the design space expands combinatorially as ingredients are added, and each batch experiment requires over 24 hours. High-throughput 96-well plate screening can reduce experimental cost, but extracting actionable predictions from growth curves requires a mechanistic model that links medium composition to cellular metabolism. In this paper, we present a differentiable simulator for dynamic flux balance analysis (dFBA) that enables scalable Bayesian inference over microbial metabolic models. A distinguishing feature is that inference is driven entirely by OD600 measurements, a simple optical proxy for biomass, without substrate or product assays; internal fluxes, substrate consumption, and secreted metabolite profiles are recovered as latent variables constrained by the metabolic network stoichiometry. We resolve the core differentiability barrier of classical dFBA by reformulating the per-step linear or quadratic programme (LP/QP) as a smooth continuous ODE (the Relaxed Interior-Point ODE, R-iODE), establishing the mathematical framework for end-to-end gradient propagation through long fermentation trajectories in JAX; full gradient validation is ongoing. The result is a framework for principled inference over thousands of batch fermentations, providing a path toward model-guided medium design, cross-strain parameter transfer, and scale-up prediction from plate data.
bioinformatics2026-05-08v1TopoFuseNet: Hierarchical Graph Representation Learning with Multi-Scale Topological Features for Accurate Drug Synergy Prediction
Wang, Q.; Shi, x.Abstract
Accurate prediction of drug synergy is paramount for developing effective combination therapies and advancing personalized medicine. Although methods based on graph neural networks (GNNs) have become a prevalent approach, they often treat molecules as flat graphs of connected atoms, thus overlooking their inherent hierarchical structure (i.e., atoms forming functional groups) and the critical topological information that governs molecular interactions. To address this limitation, we introduce TopoFuseNet, a novel hierarchical graph representation learning framework that integrates multi-scale topological features. The core innovations of TopoFuseNet include: 1) The first-ever application of "Group Centrality" from network science to cheminformatics, enabling the identification and quantification of functional groups crucial to drug activity; 2) A systematic, multi-path strategy to seamlessly integrate node-level (atom) and group-level (functional group) topological features into a Graph Attention Network (GAT) via feature augmentation, attention biasing, and hierarchical pooling; 3) A Differential Transformer module to deeply fuse multi-modal features learned from sequences, fingerprints, and our proposed hierarchical graph representations.
bioinformatics2026-05-08v1RAPID: an interactive R/Shiny platform for end-to-end 16S rRNA and ITS amplicon sequence analysis using DADA2
Kapoor, B.; Cregger, M. A.; Ranjan, P.Abstract
Abstract Motivation: Amplicon sequencing of 16S rRNA and internal transcribed spacer (ITS) gene regions is the most widely used approach for characterizing bacterial and fungal communities, respectively. The DADA2 pipeline has become a standard for inferring amplicon sequence variants (ASVs), offering single-nucleotide resolution over traditional OTU clustering. However, executing the full DADA2 workflow requires proficiency in R programming and manual coordination of multiple sequential steps, presenting a substantial barrier for researchers in clinical, environmental, and agricultural sciences who lack computational training. Results: We present RAPID (R-based Amplicon Pipeline for Interactive DADA2), a pair of R/Shiny applications providing complete graphical user interfaces for 16S rRNA and ITS amplicon sequence analysis. The 16S application implements a 10-step guided workflow from raw paired-end FASTQ files through quality filtering, error learning, dereplication, paired-read merging, chimera removal, taxonomy assignment (SILVA), phyloseq construction with data transformation (rarefaction, relative abundance, or CLR), interactive visualization (rarefaction curves, alpha diversity, NMDS, PCoA, taxonomic abundance), PERMANOVA, and ANCOM-BC2 differential abundance analysis. The ITS application extends this to an 11-step workflow, adding an automated primer removal step using cutadapt with support for multiple primers and length-variable amplicons, and uses the UNITE database for fungal taxonomy. Both applications feature asynchronous background processing, session persistence, real-time progress monitoring, publication-ready figure export, and comprehensive result downloads. Availability: RAPID is freely available at https://github.com/beantkapoor786/RAPID. Both applications can be installed locally on any system with R (version 4.0 or higher) and run as local web applications accessible through a standard browser. Keywords: 16S rRNA, ITS, amplicon sequencing, DADA2, microbiome, mycobiome, graphical user interface, Shiny, phyloseq, ASV, PERMANOVA, ANCOM-BC2
bioinformatics2026-05-08v1BART-spatial unravels biologically significant transcriptional regulators from spatial omics data
Wang, J.; Zhang, H.; Wang, Z.; Zang, C.Abstract
Transcriptional regulators (TRs) are crucial regulators of cell fate decisions by activating or repressing lineage-specific genes and integrating environmental signals with intrinsic networks. Identifying functional TRs is essential for understanding development, tissue organization, and disease. Emerging spatial transcriptomics and epigenomics technologies now provide near-single-cell resolution mapping of genomic features while preserving information of each cell's physical location and microenvironment which influence TR activity. Despite these advances, identifying active TRs in spatial data remains challenging due to low TR expression and the fact that TR activity often does not correlate directly with mRNA levels. Moreover, existing tools mainly designed for non-spatial single-cell data overlook spatial heterogeneity. To bridge this gap, we developed BART-spatial (Binding Analysis for Regulation of Prediction for spatial omics), an innovative computational method to infer functional TRs from spatial omics data. BART-spatial integrates spatial variability and pseudo-temporal information with publicly available TR binding profiles. Applied to multiple spatial datasets from diverse platforms, including 10X Visium, Visium HD, Atera, and spatial RNA-ATAC-seq, BART-spatial consistently outperforms existing methods, identifying stage-specific TRs and revealing regulators undetectable by expression alone. Its compatibility with spatial epigenomics data further strengthens its utility and enables cross-validation. Overall, BART-spatial provides a powerful and robust tool for decoding spatially resolved gene regulatory programs.
bioinformatics2026-05-08v1Metastatic Site Prediction in Breast Cancer using Kirchhoff's Law and Omics Knowledge Graph
Jha, A.; Khan, Y.; Sahay, R.; d'Aquin, M.Abstract
Predicting the anatomical site of metastasis from a primary tumour remains an unsolved problem in breast cancer (BRCA) and metastatic disease more broadly. The difficulty is structural: metastatic biology is multi-site (bone, lung, liver, brain), multi-omics (genomics, proteomics, methylomics, drug response), and multi-modal (CNV, gene expression, DNA methylation, pathways, clinical associations). Existing classifiers either collapse this heterogeneity into a single feature vector or rely on a single omics layer, both of which discard the mechanistic structure that drives metastatic tropism. We introduce Kirchhoff Knowledge Graphs (K-KG), a framework that imports the conservation laws of electrical-circuit theory into knowledge graph reasoning. Our contributions are: (1) a layered RDF Cancer Decision Network integrating 36 polyomics datasets across mutations, pathways, drugs, diseases, and reactions; (2) two novel conservation laws - the Knowledge-Graph Voltage Law (KGVL) and Knowledge-Graph Current Law (KGCL) - that govern information flow during traversal and yield a principled measure of graph completeness; (3) topological motif mining on the conserved graph, replacing expression-based feature selection by identifying triangular sub-structures whose rewiring marks metastatic transition; (4) a Graph Convolutional Neural Network whose hidden layers are the omics layers themselves, predicting site-specific metastasis as a continuous percentage rather than a binary label. On TCGA-BRCA training plus one validation and four independent test cohorts from GEO, K-KG achieves 83.8% AUC for relapse prediction and up to 0.87 AUC / 0.91 F1 for brain-site-specific prediction, outperforming Random Forest, Neural Network, and SVM baselines by 8-20 AUC points. To our knowledge this is the first application of Kirchhoff's laws (1845, 1847) to graph-based machine learning, and the first metastasis predictor that returns a per-site contribution profile rather than a single label.
bioinformatics2026-05-07v3PanVariants: Best Practice for Pangenome-based Variant Calling Pipeline and Framework
Yi, H.; Wang, L.; Chen, X.; Ding, Y.; Carroll, A.; Chang, P.-C.; Shafin, K.; Xu, L.; Zeng, X.; Zhao, X.; Gong, M.; Wei, X.; Hou, Y.; Ni, M.Abstract
Background: Although pangenome references offer richer population diversity compared to linear references, current mainstream pangenome-based variant callers are limited to detecting only known variants stored in the graph. To address this limitation, we developed PanVariants, a novel pipeline designed to improve the detection of both known and novel variants accurately. We systematically evaluated its performance against the traditional linear alignment solution (BWA+GATK/Manta) and the existing pangenome-aware solution (DRAGEN/PanGenie) in three contexts: small variants (SNVs/indels) and structural variants (SVs) accuracy in Genome in a Bottle samples, clinical detection on positive samples, and application in cohort-based joint calling. Results: By integrating k-mer-based and mapping-based methods, PanVariants significantly reduced variant errors (FPs + FNs), achieving a 73% reduction compared to BWA+GATK and a 45% reduction compared to DRAGEN for SNVs. Retraining the DeepVariant model with high-quality DNBSEQ data further decreased errors by 15%. For SVs detection, PanVariants attained an F1-score of 89.39%, markedly outperforming DRAGEN (68.18%) and BWA+Manta (58.33%), approaching long-read sequencing performance (95.22%). In validation using clinical positive samples, PanVariants successfully detected all expected pathogenic variants while PanGenie failed. In the cohort joint-calling analysis, PanVariants detected more variants, made fewer Mendelian inheritance errors, and gave better per-sample accuracy than GATK. Conclusions: PanVariants establishes a robust framework and best-practice pipeline for pangenome-based variant detection, achieving both sensitive novel variant discovery and high accuracy for SNVs, indels and SVs. Our systematic evaluation of optional processing steps and input variables offers practical guidance for users. Validated across diagnostic and population-based applications, our findings strongly support the transition from linear to pangenome references in future genomics.
bioinformatics2026-05-07v3Pan-cancer virtual spatial transcriptomics from routine histology with Phoenix
Tran, M.; Gindra, R. H.; Putze, P.; Senbai, K.; Palla, G.; Kos, T.; Falcomata, C.; Wang, C.; Guo, R.; Boxberg, M.; Berclaz, L. M.; Lindner, L. H.; Bergmayr, L.; Knoesel, T.; Jurmeister, P.; Klauschen, F.; Homicsko, K.; Gottardo, R.; Eckstein, M.; Matek, C.; Mock, A.; Theis, F. J.; Saur, D.; Peng, T.Abstract
Spatial transcriptomics links gene expression to tissue architecture, providing a mechanistic view of cellular organization. Yet existing datasets cover few donors and miss the complexity of human disease. Experimental costs remain prohibitive, and large-scale profiling is impractically slow for population-level studies. Accurate computational methods are urgently needed. Predicting gene expression from standard histology, however, remains an open problem, as current approaches transfer poorly to unseen cohorts and diseases. Here, we present Phoenix, a latent flow matching generative model that infers pan-cancer spatially resolved single-cell gene expression with high accuracy. Phoenix analyzes treatment response in silico: Applied to 763 head and neck cancer patients, it identified three new spatial biomarkers that we validated across two cancers (breast cancer, n = 84; ovarian cancer, n = 157) and treatment regimens (platinum, trastuzumab). Phoenix generalizes beyond carcinomas: In a large sarcoma cohort (802 tissue microarray cores), it accurately predicted cell-type-specific signatures in held-out samples and captured chemotherapy-induced immune remodeling. Phoenix also extends across species: In a mouse model, it accurately predicted the expression of pancreatic cancer lineage markers and the mutant mKrasG12D allele in silico. Together, Phoenix establishes virtual spatial transcriptomics from routine histology as a scalable framework for studying tissue organization, therapeutic response, and disease mechanisms.
bioinformatics2026-05-07v2Better antibodies engineered with a GLIMPSE of human data
Hepler, N. L.; Hill, A. J.; Jaffe, D. B.; Gibbons, M. C.; Pfeiffer, K. A.; Hilton, D. M.; Freeman, M.; McDonnell, W. J.Abstract
GLIMPSE-1 is a protein language model trained solely on paired human antibody sequences. It captures immunological features and achieves best-in-class performance in humanization benchmarks. We demonstrate the utility of GLIMPSE-1 in humanization; engineering of antibodies for affinity, species cross-reactivity, and key developability parameters; and the creation of highly divergent functional variants with <90% sequence identity to a marketed antibody. Learning exclusively from human antibody data enables GLIMPSE-1 to enhance therapeutics and native antibodies based on patterns in the human repertoire.
bioinformatics2026-05-07v2immuneKG: An Immune-Cell-Aware Knowledge Graph Framework for Target Discovery in Immune-Mediated Diseases
Ye, Y.; PB-IDD Department, Pharmablock Sciences Inc.,Abstract
Biomedical knowledge graphs have emerged as foundational infrastructure for AI-driven drug discovery, yet their translational impact on novel target identification in immune-mediated diseases remains limited. Here we present immuneKG, a multimodal knowledge graph centred on autoimmune diseases, constructed through biologically meaningful feature reprogramming of disease nodes to enable deep mechanistic modelling of immune-related disorders. immuneKG introduces a new entity class immune_cell, and four original directed relation types, together adding 9,105 novel triples absent from all existing biomedical KG schemas. Disease nodes are endowed with three novel modal feature sets quantifying immune homeostatic imbalance: autoantibody profiles, cytokine signatures, and HLA genotypes, complemented by systemic involvement scores and genetic features. The graph encompasses over 407,000 training triples across 7,287 entities and 32 relation types. Applied to inflammatory bowel disease (IBD), immuneKG combined with a HeteroPNA-Attn graph neural network achieves a Hits@100 of 0.99 against a Clarivate Phase II+ clinical pipeline, while a novelty-penalised scoring function surfaces high-potential dark targets. The framework shifts from conventional candidate-space screening to a development-oriented decision-support paradigm, providing actionable and interpretable guidance for downstream drug discovery. The immuneKG project is publicly available now on GitHub at https://github.com/YaowenYe/immuneKG.
bioinformatics2026-05-07v2STAT: A multi-agent framework for integrated and interactive spatial transcriptomics analysis
Chen, Y.; Han, S.; Chao, Z.; Liu, Y.; Zhang, F.; Chen, H.; Wang, J.; Xiao, J.; Yang, C.Abstract
Spatial transcriptomics analysis often involves a myriad of computational methods across diverse platforms, leading analysts to spend excessive time on data assembly rather than deriving biological insights. Current AI solutions tend to either oversimplify spatial data into generic single-cell tables or operate autonomously without opportunities for intermediate review, thus hindering the visual and iterative analyses essential for spatial biology. In response to these challenges, we introduce STAT, a multi-agent framework, designed to make spatial analysis more conversational and user-friendly while maintaining transparency and control. STAT integrates a persistent session, a shared interactive tissue viewer, and a staged skill-aware pipeline, enabling a more intuitive analytical experience. In a comprehensive benchmark evaluation encompassing eleven analytical task categories across three spatial platforms and both cell- and spot-resolution data, STAT demonstrated superior performance compared to a baseline large language model and existing autonomous spatial analysis agents, excelling in task completion, analytical quality, and token efficiency. Notably, STAT enables multi-task spatial analysis of a mixed-resolution breast cancer cohort, successfully reproducing key findings from a published Visium HD colorectal cancer study based solely on natural language prompts. STAT thus facilitates trustworthy and scientifically rigorous spatial transcriptomics analysis, allowing researchers to focus more on biological interpretation.
bioinformatics2026-05-07v2Scalable subclonal reconstruction of cancer cells in DNA sequencing data using a penalized likelihood model
Jiang, Y.; Montierth, M. D.; Ding, Y.; Yu, K.; Tran, Q.; Wu, A.; Li, R.; Ji, S.; Liu, X.; Shin, S. J.; Cao, S.; Tang, Y.; Lesluyes, T.; Kimmel, M.; Wang, J. R.; Tarabichi, M.; Zhu, H.; Van Loo, P.; Wang, W.Abstract
Tumor subclonal architecture shapes cancer evolution, yet subclonal reconstruction from bulk sequencing remains difficult to scale due to computational cost and model complexity. We present CliPP, a penalized-likelihood framework that jointly estimates cellular prevalence with pairwise fusion penalties, automatically identifying subclones without requiring extensive priors. Across simulations and 2,778 whole-genome tumors with external consensus reconstructions, CliPP achieves consistently good performances when compared to state-of-the-art approaches while providing substantial runtime reductions. Applied to 7,000+ tumors across >30 cancer types, CliPP quantifies pervasive subclonality and delineates cohort-level subclone landscapes. CliPP enables fast, reproducible large-scale subclonal analysis and is freely available to the community through GitHub and a shiny app.
bioinformatics2026-05-07v2Gene-Modulated Network Diffusion for Improved Modeling of Amyloid-β Spread in Alzheimer's Disease
Xu, F. H.; Duong-Tran, D.; Huang, H.; Saykin, A. J.; Thompson, P. M.; Davatzikos, C.; Zhao, Y.; Shen, L.Abstract
Understanding the pathogenesis of amyloid-{beta} pathology in Alzheimer's Disease (AD) proves to be a challenge. In this work, we expand upon the application of network diffusion models (NDM) to study pathophysiological spread of amyloid-{beta} throughout white matter structural brain networks. We found that the NDM successfully recaptures subpopulation-level spatial patterns (Pearson's R=0.45-0.48, PFDR < 0.01) of amyloid-{beta} deposition in the Alzheimer's Disease Neuroimaging Cohort at a regional level, but with drawbacks in mechanism interpretability. We then moved to an extended NDM framework (eNDM), including a protein synthesis term to better reflect the role of amyloid-{beta} metabolism, as well as including regional vulnerability using spatial transcriptomics from the Allen Human Brain Atlas to modulate the region-level rate parameters of the synthesis term. The novel gene eNDMs exhibited significant performance increases in Pearson's correlation (Steiger's Z, PFDR < 0.10) over baseline NDM performance in mild cognitive impairment and AD groups using APOE, SORL1, and FGL2 for gene modulation. The results were robust and replicable when testing on an external cohort of the Alzheimer's Disease Sequencing Project. The study thus demonstrates the importance of regional genetic vulnerability, in conjunction with network diffusion mechanisms, in improving the modelling and prediction of amyloid-{beta} pathophysiological spread.
bioinformatics2026-05-07v1geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration
Feng, Z.; Li, T.Abstract
Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets-arising from annotation version updates, historical renaming, and synonym reassignment-introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020-2025 and five CellRanger versions shows that 1.41%-6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707-1,098 genes per dataset pair. Notably, CellRanger annotation version-rather than data collection year-was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync.
bioinformatics2026-05-07v1metaJAM: a Nextflow integrated metagenomic workflow for sedimentary ancient DNA
Johnson, E.; Jin, C.; Guinet, B.; Alumbaugh, J.; Martin, N. L.Abstract
The application of metagenomics in ancient DNA (aDNA) research is rapidly expanding, driven in particular by advances in sedimentary aDNA research and sequencing technologies. Although many ancient DNA studies rely on broadly similar bioinformatic strategies, there is still no single standardized, widely adopted workflow. These differences can directly affect how efficiently past biodiversity can be reconstructed and authenticated from the various archives analyzed using ancient metagenomic approaches. Although a few pipelines tackle the processing of ancient DNA data from shotgun sequencing, the ones applied to metagenomic datasets are scarce and often resource-intensive or challenging to install, update, or extend with new tools and parameters. metaJAM, a scalable and user-friendly pipeline, is presented here to specifically address the challenges of metagenomic aDNA analyses of eukaryotes. The pipeline has been designed in Nextflow to ensure continuous development and can be used on different high-performance computing (HPC) clusters. metaJAM integrates all key steps required for ancient DNA metagenomic analyses, from raw sequencing data pre-processing to microbial filtering, taxonomic assignment via competitive iterative mapping against Bowtie 2 reference indexes and reassignment using lowest common ancestor (LCA) inference. Validation and authentication are performed using the post-LCA toolkit bamdam together with alignment to an exhaustive reference database using MMseqs2. It allows users to choose among alternative tools and generates a series of plots to support data visualization and taxon authentication. metaJAM differs from existing pipelines through its implementation of rigorous filtering of microbial-like reads by Kraken 2 classification and masking microbial-like regions, iterative or parallel Bowtie 2 mapping, validation of the detected taxa and integration of up-to-date tools for ancient metagenomic analysis, along with diagnostic plots that help users assess the reliability of taxonomic assignments and visualize their data. It complies well with limited computational resources, customised databases for taxonomical groups, and provides an accessible workflow to support the investigation of metagenomic ancient DNA datasets. Its applications span a range of contexts, from ecosystem reconstructions in environmental aDNA archives such as sediments, to metagenomic studies on archaeological artefacts and even taxonomic identification of undiagnosed biological materials.
bioinformatics2026-05-07v1Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges
Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.Abstract
Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.
bioinformatics2026-05-07v1Striping artifact removal in VisiumHD data through nuclear counts modeling
Malsot, P.; Londschien, M.; Boeva, V.; Raetsch, G.Abstract
Motivation: 10x Genomics VisiumHD enables spatial transcriptomics at 2 m x 2 m resolution but exhibits slide-specific, non-periodic striping artifacts due to lane-width variability. These multiplicative row/column effects distort bin total counts and can bias downstream analyses. The state-of-the-art destriping approach is the normalization procedure used as a preprocessing step in bin2cell; it applies sequential high-quantile row- then column-wise normalization, which is asymmetric and can introduce edge effects/macro-stripes and distortions of large-scale total-count structure. Results: We propose a statistical destriping approach that leverages nuclei segmentation from the co-registered H&E image. Assuming transcript abundance is constant within each nucleus, we model bin counts with a negative binomial distribution whose mean is a product of a nucleus-specific concentration and row- and column-specific stripe-factors reflecting lane-width variation. We fit all parameters in a generalized linear modeling framework with cross-validated regularization on stripe-factors and iterative dispersion estimation, and use the fitted parameters to correct the observed counts into a destriped image. On synthetic data with known ground truth, our method improves stripe-factor estimation accuracy and reduces error in corrected counts relative to bin2cell and bin2cell-derived baselines. Across four public VisiumHD slides, it consistently lowers striping intensity while substantially better preserving biological signal present in the large-scale global count structure and avoiding the artifacts introduced by other methods. Availability and Implementation: All source code and links to publicly available data used for this study are available at https://github.com/paolamalsot/destriping-GLM. Contact: paola.malsot@inf.ethz.ch, raetsch@inf.ethz.ch Note: This manuscript extends the version submitted to Intelligent Systems for Molecular Biology (ISMB) 2026 by describing a new optimization algorithm that yields an approximately tenfold speedup. All plots and benchmarks in this manuscript use the updated implementation.
bioinformatics2026-05-07v1SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs
McFee, M. C.; Kim, P. M.Abstract
Short linear motifs (SLiMs) are short (3-15 amino acids in length) segments within intrinsically disordered regions (IDRs) that mediate transient protein-protein interactions as well as other functions such as stability and subcellular localization. Only a few thousand out of likely hundreds of thousands have been experimentally validated. SLiMs can be detected as conserved regions inside of IDRs using local alignments, though current approaches have limited sensitivity and specificity and are unable to functionally annotate their hits. Assigning function is hence a major outstanding issue in SLiM biology. Here we present SLiMNet, a deep learning model inspired by siamese networks and contrastive learning that predicts functional similarity in pairs of SLiMs. SLiMNet uses uses protein large language model embeddings and is trained on annotated sets of SLiMS. We show that it detects shared function in unseen, non-redundant motif pairs, and its scores correlate with experimental binding strengths from deep mutational scanning of cyclin-binding motifs. Using SLiMNet we provide repositories of putative SLiM pairs derived from annotated IDR regions for to help with hypothesis generation for the functional annotation of SLiMs. This includes an atlas generated from all-by-all scoring 16-mers from tiled IDRs from the DisProt database. We show that it captures a new nuclear localization motif recently added to MoMaP and a PRMT1 methylation motif in the literature. We also provided a repository of all IDRs scored with SLiMNet against against all MoMaP instances, and an atlas of potential functional pairs for 256 known orphan motifs (motifs with only a single known instance with essential function). Collectively, these atlases are useful resources for the SLiM biology community
bioinformatics2026-05-07v1scLASER: a robust framework for simulating and detecting time-dependent single-cell dynamics in longitudinal studies
Vanderlinden, L. A.; Vargas, J.; Inamo, J.; Young, J.; Wang, C.; Zhang, F.Abstract
Longitudinal single-cell clinical studies enable tracking within-individual cellular dynamics, but methods for modeling temporal phenotypic changes and estimating power remain limited. We present scLASER, a framework detecting time-dependent cellular neighborhood dynamics and simulating longitudinal single-cell datasets for power estimation. Across benchmark experiments, scLASER shows consistently higher sensitivity than traditional cluster--based approaches, with particularly pronounced gains in rare cell types and non-linear temporal patterns. Applications to inflammatory bowel disease (95,813 cells, 38 patients) reveal treatment-responsive NOTCH3+ stromal trajectories with high cell type discrimination (AUC > 0.92), while analysis of COVID-19 data (188,181 cells, 84 patients) identifies three distinct axes of T cell activity (cytotoxic effector, NK immunoreceptor signaling, and interferon-stimulated gene programs) over disease progression. scLASER enables robust longitudinal single-cell analysis and optimization of study design.
bioinformatics2026-05-07v1A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome
Karthik, A. S. P.; Das, A. B.Abstract
We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.
bioinformatics2026-05-07v1Bridging genomes and peptidomes: hybrid sequencing reveals conserved bioactive peptides in crustaceans
Fields, L.; Qin, J.; Ibarra, A. E.; Selby, K. G.; Gao, T.; Dang, T. C.; Lu, H.; Li, L.Abstract
Endogenous peptides are critical regulators of signaling and immunity but remain difficult to characterize in organisms with incomplete genomic annotation. We developed a hybrid discovery platform that integrates transformer-based de novo sequencing (Casanovo), neuropeptide-focused database searching (EndoGenius), and empirical false discovery rate estimation via NovoBoard. This pipeline enables confident identification of endogenous peptides while expanding coverage beyond conventional database-only or de novo-only approaches. Applied to neuroendocrine tissues from Callinectes sapidus and Cancer borealis, the workflow revealed numerous high-abundance novel peptides and provided structural and genomic support for their biological relevance. Notably, we report the first histone-2A-derived antimicrobial peptide in the C. sapidus and characterize naturally occurring sequence variants. We also identified unexpected peptide homologies between crustaceans and Rattus norvegicus, enabling annotation of conserved housekeeping proteins in sparsely annotated genomes. This hybrid platform establishes a scalable, open-source strategy for advancing neuropeptidomics and endogenous peptide discovery in emerging model organisms.
bioinformatics2026-05-07v1ORBIT: Orthogonal Rotation for Biological Inter-species Transfer
Wissenberg, P.; Lee, J. M.; Mutwil, M.Abstract
Motivation. Cross-species gene embeddings are central to transferring functional annotations between species. A recent method demonstrated that species-specific STRING (PPI) network embeddings can be aligned across 1322 eukaryotes with autoencoders (FedCoder), but this approach is computationally expensive, depends on careful hyperparameter selection, leaves substantial room for improvement in cross-species retrieval quality, and has not been demonstrated on coexpression networks. Results. We introduce an alignment pipeline for cross-species coexpression network embeddings based on orthogonal Procrustes rotation. Species-specific Node2Vec embeddings of coexpression networks are aligned to a shared space using ortholog anchors from OrthoFinder, solved in closed form via Singular Value Decomposition (SVD). Applied to 153 plant species and 5.7 million genes, Procrustes alignment achieves four-fold higher cross-species Spearman correlation and consistently higher retrieval metrics than the SPACE autoencoder, while leaving within-species coexpression structure invariant (preservation ratio 1.000 against the unaligned baseline). The full alignment completes in under three minutes on a single CPU, and on downstream tasks, Procrustes embeddings improve within-species GO term prediction and outperform SPACE for cross-species GO transfer. Procrustes and sequence embeddings remain complementary for biological-process prediction, consistent with observations from SPACE. Availability. Code for producing the embeddings is made available at https://github.com/pwissenberg/orbit
bioinformatics2026-05-07v1Image-Conditioned Diffusion for Privacy-Preserving Synthetic Medical Images
Yaya-Stupp, D.; Lutsker, G.; Spiegel-Yerushalmi, O.; Segal, E.Abstract
Medical imaging models depend on large, shareable datasets, yet privacy constraints limit data dissemination. Current text-conditioned diffusion models fail to preserve subtle, distributed clinical signals, such as continuous physiological biomarkers, rendering synthetic data insufficient for robust downstream physiological modeling. Here, we evaluate image-to-image (I2I) diffusion as a tunable, privacy-preserving transformation that produces a synthetic counterpart of real images while preserving downstream-relevant information. We fine-tune Stable Diffusion with low-rank adapters on retinal fundus photographs and chest radiographs, assessing fidelity, clinical signal preservation, cross-site transfer, and empirical re-identification risk. I2I consistently outperforms text-to-image generation in image fidelity and in preserving biomarker information. In cross-cohort transfer to an external retinal dataset from the UK Biobank, pretraining on I2I synthetic data performs comparably to real-image pretraining and surpasses it in the smallest fine-tuning sets. Varying I2I strength reveals that the privacy-utility tradeoff is highly modality-dependent: while retinal images achieve practical de-identification, chest X-rays exhibit structural combinatorics that leave them substantially re-identifiable even at high noise strengths, exposing critical boundaries for diffusion-based anonymization. These results position image-conditioned diffusion as a practical approach for generating shareable medical images with tunable de-identification.
bioinformatics2026-05-07v1BGC-QUAST: a quality assessment tool for genome mining software
Kushnareva, A.; Tupikina, D.; Almessady, H.; McHardy, A.; Gurevich, A.Abstract
Summary: Biosynthetic gene clusters (BGCs) encode microbial natural products, many of which have important ecological and biomedical roles. Genome mining tools enable large-scale BGC prediction, but their outputs differ substantially, complicating comparison and interpretation. We present BGC-QUAST, a framework for evaluating and comparing BGC predictions across three analysis modes: comparison across samples, assessment of BGC recovery in draft assemblies relative to reference genomes, and comparison of predictions from different tools using overlap analysis. BGC-QUAST provides standardized metrics, interactive visualizations, and integrated outputs for joint inspection of predictions, enabling the comprehensive comparison of genome mining results and facilitating sample prioritisation based on biosynthetic potential. Availability and implementation: BGC-QUAST is publicly available at https://github.com/gurevichlab/bgc-quast
bioinformatics2026-05-07v1A vaccine for global eradication of TB - A novel conceptual framework and design of a potent peptide-based vaccine with universal coverage through advanced computational vaccinology
Pawar, P.; samarasinghe, s.Abstract
Tuberculosis (TB) remains a formidable global health challenge, exacerbated by the emergence of drug-resistant Mycobacterium tuberculosis strains that threaten to render existing drug therapies and vaccine ineffective. Despite the availability of the Bacillus Calmette-Guerin (BCG) vaccine, its limited efficacy, primarily in infants and young children, falls short of reducing TB prevalence or offering adequate protection to adults. Therefore, developing a new TB vaccine with enhanced efficacy and the capability to generate a robust reservoir of memory cells is essential. Addressing the challenge of drug-resistant tuberculosis requires a deep understanding of bacterial evolution and developing robust countermeasures. This study aims to design a next-generation TB vaccine that provides broad-spectrum protection against various Mycobacterium tuberculosis strains, including drug-resistant ones. By conducting an in-depth investigation into pathogen-human interactions, the research proposes a holistic framework that leverages computational vaccinology to tackle challenges posed by pathogen polymorphism and overcome the limitations of conventional vaccines. By targeting conserved proteins across diverse TB strains and enhancing both humoral and cell-mediated immunity, this study proposes a new strategy for an epitope-based vaccine that provides long-lasting, universal coverage. An extensive proteomic, reverse vaccinology and immunoinformatics analysis of 159 TB strains yielded 27 highly conserved, immunogenic, non-toxic, and non-allergenic epitopes. These epitopes, consisting of 14 cytotoxic T-lymphocytes (CTL), 5 helper T-lymphocytes (HTL), and 8 B-cell epitopes, were used to construct a three-dimensional, multi-epitope TB vaccine designed based on a new concept introduced in this research for maximising vaccine efficacy. Molecular docking and immune simulation studies demonstrated a significant affinity between the vaccine constructs and toll-like receptors, indicating a strong potential for effective immune system engagement. The crucial features of the epitope-based TB vaccine constructed in this research include sequence conservancy, robust antigenicity, exclusion of self-peptides and potential for diverse allelic interactions. The proposed epitope-based vaccine is poised to be highly effective, safe, and capable of providing universal coverage, potentially paving the way for global TB eradication. Validation in laboratory and clinical settings will be essential to confirm its efficacy and real-world applicability.
bioinformatics2026-05-07v1Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling
Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.Abstract
Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis--Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12 higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.
bioinformatics2026-05-07v1Solid Tumors Pan Cancer Transcriptome Tissue/Cancer specific expression groups at the Isoform-Level
Surana, P.; Obusan, M.; Davuluri, R. V.Abstract
Most of the human genome is transcribed into diverse isoforms whose tissue specificity is profoundly disrupted in cancer, yet isoform-level dysregulation remains poorly characterized across solid tumors. Here, we introduce STPCaT (Solid Tumors Pan-Cancer Transcriptome), an isoform-centric analysis extending TransTEx to systematically classify transcript expression across TCGA solid tumors and GTEx normal tissues. STPCaT reveals a striking collapse of normal tissue-specific programs in cancer, accompanied by the emergence of two dominant expression groups: cancer-high (CanHigh) and normal-high (NorHigh) isoforms. We uncover a large repertoire of previously unannotated Cancer-Testis Antigens (CTAs), the majority of which are absent from existing CTA databases, with broad relevance across multiple cancers, including gliomas. In pan-gliomas, consensus clustering and random-forest feature selection identify compact, highly discriminative isoform signatures that robustly stratify low-grade and glioblastomas with up to 97 to 98% accuracy using as few as five transcripts. These signatures recapitulate canonical glioma biology and highlight pathways linked to migration, development, and vesicle trafficking. Independent validation in the GLASS consortium cohort demonstrates cohort-specific trends that partially recapitulate primary findings, reflecting known biological heterogeneity across patient populations. Together, STPCaT provides a scalable, isoform-resolved resource for tumor stratification, CTA discovery, and precision oncology applications across solid tumors.
bioinformatics2026-05-07v1ProtSpace: Protein Universe in Your Browser
Senoner, T.; Vahidi, P.; Olenyi, T.; Senoner, F.; Sisman, G.; Kahl, E.; Rost, B.; Koludarov, I.Abstract
Protein Language Models (pLMs) generate per-protein embeddings that encode functional, structural, and evolutionary information, yet the relationships captured in these representations remain difficult to explore systematically. ProtSpace (https://protspace.app) is a web application for interactive visualization of pLM embedding spaces, enabling hypothesis generation directly in the browser without installation. Unlike traditional network-based tools that exclusively visualize amino acid sequence similarity, ProtSpace explores embedding spaces, revealing relationships often not captured by traditional comparisons. Users provide protein sequences or pre-computed embeddings through a Google Colab notebook or the Python CLI; the pipeline applies dimensionality reduction, retrieves 38 annotation types spanning UniProt, InterPro, NCBI Taxonomy, TED structural domains, and sequence-based predictors served via Biocentral, and produces a portable binary file for the browser-based viewer. WebGL-accelerated rendering supports interactive exploration of over 570,000 proteins. Distinctive features include per-point pie charts for multi-label annotations and integrated 3D structure viewing through AlphaFold2 predictions. All computation happens on the user's machine, ensuring data privacy. We demonstrate the utility of ProtSpace through a progressive zoom-in across biological scales: from global proteome organization of Swiss-Prot, through cross-species comparison revealing conserved and lineage-specific families, to functional hypothesis generation within the beta-lactamase superfamily. ProtSpace is freely available at https://protspace.app under the Apache 2.0 license.
bioinformatics2026-05-07v1DupyliCate: mining, classifying, and characterizing gene duplications
Natarajan, S.; Pucker, B.Abstract
Paralogs, copies of a gene, form an important basis for novelty during evolution. Analysis of such gene duplications is important to understand the emergence of novel traits during evolution. DupyliCate is a Python tool that has been developed for this purpose. With the ability to process multiple datasets concurrently, flexible features, and parameters to set species-specific thresholds, DupyliCate offers a high-throughput method for gene copy identification and analysis. The different available parameters and modes are explored in detail based on the Arabidopsis thaliana datasets. Proof of concept for the tool is presented by characterizing well known duplications in different plants, and its broad applicability is demonstrated by running it on diverse datasets including complex plant genome sequences with high heterozygosity. Further, two case studies involving the evolution of flavonol synthase (FLS) genes in Brassicales, and the evolution of flavonol synthesis regulating myeloblastosis (MYB) transcription factors- MYB12 and MYB111 across a large number of plant species, are presented as exemplar use cases. The tool's applicability beyond plants is demonstrated on Escherichia coli, Saccharomyces cerevisiae, and Caenorhabditis elegans datasets. DupyliCate is available at: https://github.com/ShakNat/DupyliCate.
bioinformatics2026-05-06v4Identification, evolutionary history and characteristics of orphan genes in root-knot nematodes
Seckin, E.; Colinet, D.; Bailly-Bechet, M.; Seassau, A.; Bottini, S.; Sarti, E.; Danchin, E. G.Abstract
Orphan genes, lacking homologs in other species, are systematically found across genomes. Their presence may result from extensive divergence from pre-existing genes or from de novo gene birth, which occurs when a gene emerges from a previously non-genic region. In this study, we identified orphan genes in the genomes of globally distributed plant-parasitic nematodes of the genus Meloidogyne and investigated their origins, evolution, and characteristics. Using a comparative genomics framework across 85 nematode species, we found that 18% of Meloidogyne genes are genus-specific, transcriptionally supported orphans. By combining ancestral sequence reconstruction and synteny-based approaches, we inferred that 20% of these orphan genes originated through high divergence, while 18% likely emerged de novo. Proteomic and translatomic evidence confirmed the translation of a subset of these genes, and feature analyses revealed distinctive molecular signatures, including shorter length, signal peptide enrichment, and a tendency for extracellular localization. These findings highlight orphan genes as a substantial and previously underexplored component of the Meloidogyne genome, with potential roles in their worldwide parasitism.
bioinformatics2026-05-06v3Large-Scale Statistical Dissection of Sequence-Derived Biochemical Features Distinguishing Soluble and Insoluble Proteins
Vu, N. H. H.; Nguyen Bao, L.Abstract
Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a statistically rigorous large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliffs {delta}, and discriminative performance was assessed by ROC-AUC. Although 34 features remained significant after correction, most exhibited small effect sizes and substantial class overlap, consistent with a weak-signal regime. The strongest effects were associated with size-related features (sequence length and molecular weight; {delta} {approx} -0.21), whereas charge-related descriptors, particularly the proportion of negatively charged residues ({delta} = 0.150; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ({rho} up to 0.998). Applying a redundancy threshold (|{rho}| [≥] 0.85), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings demonstrate that sequence-level solubility information is intrinsically low-dimensional and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.
bioinformatics2026-05-06v3Advancing in silico drug design with Bayesian refinement of AlphaFold models
Sen, S.; Hoff, S. E.; Morozova, T. I.; Schnapka, V.; Bonomi, M.Abstract
Virtual screening has become an indispensable tool in modern structure-based drug discovery, enabling the identification of candidate molecules by computationally evaluating their potential to bind target proteins. The accuracy of such screenings critically depends on the quality of the target structures employed. Recent advances in protein structure prediction, particularly AlphaFold2, have revolutionized this field with unprecedented accuracy. However, AlphaFold2 models often exhibit limitations in local structural details, especially within binding pockets, which limit their utility for small molecule docking. In contrast, molecular dynamics simulations with accurate atomistic force fields can refine protein structures, but lack the ability to leverage the structural information provided by deep learning approaches. Here, we introduce bAIes, an integrative method that bridges this gap by combining physics-based force fields with data-driven predictions through Bayesian inference. Crucially, bAIes demonstrates a superior ability to discriminate between binders and non-binders in virtual screening campaigns, outperforming both AlphaFold2 and molecular dynamics-refined models. By enhancing the usability of AlphaFold2 models without requiring extensive experimental or computational resources, bAIes offers a convenient solution to a longstanding challenge in structure-based drug design, potentially accelerating the early phases of drug discovery.
bioinformatics2026-05-06v2PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes
Muneeb, M.; Ascher, D. B.Abstract
Identifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. Here, we present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. The pipeline is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.
bioinformatics2026-05-06v1