PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)

Predicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein’s structure, is a significant step toward addressing the protein struct...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wafa Alanazi, Di Meng, Gianluca Pollastri
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Biomolecules
Subjects:	protein structure prediction structural bioinformatics bioinformatics natural language processing computational biology deep learning
Online Access:	https://www.mdpi.com/2218-273X/15/1/49
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832588945689411584
author	Wafa Alanazi Di Meng Gianluca Pollastri
author_facet	Wafa Alanazi Di Meng Gianluca Pollastri
author_sort	Wafa Alanazi
collection	DOAJ
description	Predicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein’s structure, is a significant step toward addressing the protein structure prediction challenge. Today, deep learning is arguably the most powerful method for predicting RSA and other structural features of proteins. In particular, recent breakthroughs in deep learning—driven by the integration of natural language processing (NLP) algorithms—have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study leverages pre-trained language models (PLMs) to enhance RSA prediction. We present a deep neural network architecture based on a combination of bidirectional recurrent neural networks and convolutional layers that can analyze long-range interactions within protein sequences and predict protein RSA using ESM-2 encoding. The final predictor, PaleAle 6.0, predicts RSA in real values as well as two-state (exposure threshold of 25%) and four-state (exposure thresholds of 4%, 25%, and 50%) discrete classifications. On the 2022 test set dataset, PaleAle 6.0 achieved over 82% accuracy for two-state RSA (RSA_2C) and 59.75% accuracy for four-state RSA (RSA_4C), with a Pearson correlation coefficient (PCC) of 77.88 for real-value RSA prediction. When evaluated on the more challenging 2024 test set, PaleAle 6.0 maintained a strong performance, achieving 79.74% accuracy in the two-state prediction and 55.30% accuracy in the four-state prediction, with a PCC of 73.08 for real-value predictions, outperforming all previously benchmarked predictors.
format	Article
id	doaj-art-f63b7569d6554883bbc9f747d9d5d6fd
institution	Kabale University
issn	2218-273X
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Biomolecules
spelling	doaj-art-f63b7569d6554883bbc9f747d9d5d6fd2025-01-24T13:24:59ZengMDPI AGBiomolecules2218-273X2025-01-011514910.3390/biom15010049PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)Wafa Alanazi0Di Meng1Gianluca Pollastri2School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, IrelandSchool of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, IrelandSchool of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, IrelandPredicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein’s structure, is a significant step toward addressing the protein structure prediction challenge. Today, deep learning is arguably the most powerful method for predicting RSA and other structural features of proteins. In particular, recent breakthroughs in deep learning—driven by the integration of natural language processing (NLP) algorithms—have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study leverages pre-trained language models (PLMs) to enhance RSA prediction. We present a deep neural network architecture based on a combination of bidirectional recurrent neural networks and convolutional layers that can analyze long-range interactions within protein sequences and predict protein RSA using ESM-2 encoding. The final predictor, PaleAle 6.0, predicts RSA in real values as well as two-state (exposure threshold of 25%) and four-state (exposure thresholds of 4%, 25%, and 50%) discrete classifications. On the 2022 test set dataset, PaleAle 6.0 achieved over 82% accuracy for two-state RSA (RSA_2C) and 59.75% accuracy for four-state RSA (RSA_4C), with a Pearson correlation coefficient (PCC) of 77.88 for real-value RSA prediction. When evaluated on the more challenging 2024 test set, PaleAle 6.0 maintained a strong performance, achieving 79.74% accuracy in the two-state prediction and 55.30% accuracy in the four-state prediction, with a PCC of 73.08 for real-value predictions, outperforming all previously benchmarked predictors.https://www.mdpi.com/2218-273X/15/1/49protein structure predictionstructural bioinformaticsbioinformaticsnatural language processingcomputational biologydeep learning
spellingShingle	Wafa Alanazi Di Meng Gianluca Pollastri PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs) Biomolecules protein structure prediction structural bioinformatics bioinformatics natural language processing computational biology deep learning
title	PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_full	PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_fullStr	PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_full_unstemmed	PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_short	PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_sort	paleale 6 0 prediction of protein relative solvent accessibility by leveraging pre trained language models plms
topic	protein structure prediction structural bioinformatics bioinformatics natural language processing computational biology deep learning
url	https://www.mdpi.com/2218-273X/15/1/49
work_keys_str_mv	AT wafaalanazi paleale60predictionofproteinrelativesolventaccessibilitybyleveragingpretrainedlanguagemodelsplms AT dimeng paleale60predictionofproteinrelativesolventaccessibilitybyleveragingpretrainedlanguagemodelsplms AT gianlucapollastri paleale60predictionofproteinrelativesolventaccessibilitybyleveragingpretrainedlanguagemodelsplms

PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)

Similar Items