SORFPP

Enhancing In-depth and Rich Sequence-driven Information to Identify LncRNA Encoded Peptide Based on A Fused Computed Framework on Experiment Validation Datasets


Introduction:

Background: With the advancement of genomic sequencing technologies, functional peptides (SEPs) encoded by short open reading frames (sORFs) within long non-coding RNAs (lncRNAs) have been recognized for their crucial roles in fundamental cellular processes and diseases. Due to the small molecular weight of SEPs, distinct from conventional peptides, and the lack of dedicated comprehensive data resources, predicting SEPs becomes significantly challenging. Therefore, there is an urgent need to construct a comprehensive dataset and research specifically addressing the characteristics of short peptides.

Results: This paper introduces a fused computational framework named SORFPP for predicting SEPs from multiple perspectives based on an experimentally validated dataset, TranLnc. The limitations of traditional peptide feature encoding for SEPs were recognized, and nucleotide-level features were incorporated to enhance the encoding. The ESM-1b model was employed for a more in-depth exploration of SEPs information. Considering the encoding differences, traditional features were utilized in CatBoost, while deep features were applied to Self-attention. Results from both traditional and deep methods were integrated to improve predictive performance, achieving accurate and stable predictions. Regarding the Matthew's correlation coefficient, significantly superior performance in SORFPP was demonstrated across the three datasets compared to models of excellence developed in the past five years, with MCC increasing by 12.2%-24.2%.

Conclusion: Experimental comparisons reveal the positive impact of integrating nucleotide and deep encoding features. Additionally, utilizing an ensemble learning framework to fuse results from base models contributes to enhancing predictive performance. The experiments indicate that SORFPP is a reliable method for identifying the activity of SEPs.