Skip to main navigation menu Skip to main content Skip to site footer

Deep Sequence Model for Genome Wide Discovery of Coding and Regulatory Element Signatures

Abstract

Interpreting deep neural networks for genomic sequence classification remains challenging despite strong predictive performance. We develop a lightweight CNN with post-training gradient-based analysis to identify which sequence positions drive coding versus intergenomic classification. Applied to the standardized demo coding vs ntergenomic dataset, our model achieves 91.7% validation accuracy while revealing interpretable patterns: nine hot spots with consistently high importance, clear class-specific separation at positions 20--100 (coding) and 150--190 (intergenomic), and a strong mean variance correlation r = 0.530 indicating robust discriminative features. Gradient-based importance analysis shows that the model implicitly learns biologically meaningful sequence distinctions without explicit annotations. This work demonstrates that neural network interpretability and accuracy can coexist, providing a framework for understanding genomic sequence classification and enabling biology-driven hypothesis generation.
PDF