iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model

Non-classical secreted proteins (NCSPs) are a class of proteins lacking signal peptides, secreted by Gram-positive bacteria through non-classical secretion pathways. With the increasing demand for highly secreted proteins in recent years, non-classical secretion pathways have received more attention...

Full description

Saved in:
Bibliographic Details
Main Authors: Yizhou Shao, Taigang Liu
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S200103702500114X
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Non-classical secreted proteins (NCSPs) are a class of proteins lacking signal peptides, secreted by Gram-positive bacteria through non-classical secretion pathways. With the increasing demand for highly secreted proteins in recent years, non-classical secretion pathways have received more attention due to their advantages over classical secretion pathways (Sec/Tat). However, because the mechanisms of non-classical secretion pathways are not yet clear, identifying NCSPs through biological experiments is expensive and time-consuming, making it imperative to develop computational methods to address this issue. Existing NCSP prediction methods mainly use traditional handcrafted features to represent proteins from sequence information, which limits the models' ability to capture complex protein characteristics. In this study, we proposed a novel NCSP predictor, iNClassSec-ESM, which combined deep learning with traditional classifiers to enhance prediction performance. iNClassSec-ESM integrates an XGBoost model trained on comprehensive handcrafted features and a Deep Neural Network (DNN) trained on hidden layer embeddings from the protein language model (PLM) ESM3. The ESM3 is the recently proposed multimodal PLM and has not yet been fully explored in terms of protein representation. Therefore, we extracted hidden layer embeddings from ESM3 as inputs for multiple classifiers and deep learning networks, and compared them with existing PLMs. Benchmark experiments indicate that iNClassSec-ESM outperforms most of existing methods across multiple performance metrics and could serve as an effective tool for discovering potential NCSPs. Additionally, the ESM3 hidden layer embeddings, as an innovative protein representation method, show great potential for the application in broader protein-related classification tasks. The source code of iNClassSec-ESM and the ESM3 embeddings extraction script are publicly available at https://github.com/AmamiyaHoshie/iNClassSec-ESM/.
ISSN:2001-0370