PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)

Predicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein’s structure, is a significant step toward addressing the protein struct...

Full description

Saved in:
Bibliographic Details
Main Authors: Wafa Alanazi, Di Meng, Gianluca Pollastri
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Biomolecules
Subjects:
Online Access:https://www.mdpi.com/2218-273X/15/1/49
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588945689411584
author Wafa Alanazi
Di Meng
Gianluca Pollastri
author_facet Wafa Alanazi
Di Meng
Gianluca Pollastri
author_sort Wafa Alanazi
collection DOAJ
description Predicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein’s structure, is a significant step toward addressing the protein structure prediction challenge. Today, deep learning is arguably the most powerful method for predicting RSA and other structural features of proteins. In particular, recent breakthroughs in deep learning—driven by the integration of natural language processing (NLP) algorithms—have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study leverages pre-trained language models (PLMs) to enhance RSA prediction. We present a deep neural network architecture based on a combination of bidirectional recurrent neural networks and convolutional layers that can analyze long-range interactions within protein sequences and predict protein RSA using ESM-2 encoding. The final predictor, PaleAle 6.0, predicts RSA in real values as well as two-state (exposure threshold of 25%) and four-state (exposure thresholds of 4%, 25%, and 50%) discrete classifications. On the 2022 test set dataset, PaleAle 6.0 achieved over 82% accuracy for two-state RSA (RSA_2C) and 59.75% accuracy for four-state RSA (RSA_4C), with a Pearson correlation coefficient (PCC) of 77.88 for real-value RSA prediction. When evaluated on the more challenging 2024 test set, PaleAle 6.0 maintained a strong performance, achieving 79.74% accuracy in the two-state prediction and 55.30% accuracy in the four-state prediction, with a PCC of 73.08 for real-value predictions, outperforming all previously benchmarked predictors.
format Article
id doaj-art-f63b7569d6554883bbc9f747d9d5d6fd
institution Kabale University
issn 2218-273X
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Biomolecules
spelling doaj-art-f63b7569d6554883bbc9f747d9d5d6fd2025-01-24T13:24:59ZengMDPI AGBiomolecules2218-273X2025-01-011514910.3390/biom15010049PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)Wafa Alanazi0Di Meng1Gianluca Pollastri2School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, IrelandSchool of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, IrelandSchool of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, IrelandPredicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein’s structure, is a significant step toward addressing the protein structure prediction challenge. Today, deep learning is arguably the most powerful method for predicting RSA and other structural features of proteins. In particular, recent breakthroughs in deep learning—driven by the integration of natural language processing (NLP) algorithms—have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study leverages pre-trained language models (PLMs) to enhance RSA prediction. We present a deep neural network architecture based on a combination of bidirectional recurrent neural networks and convolutional layers that can analyze long-range interactions within protein sequences and predict protein RSA using ESM-2 encoding. The final predictor, PaleAle 6.0, predicts RSA in real values as well as two-state (exposure threshold of 25%) and four-state (exposure thresholds of 4%, 25%, and 50%) discrete classifications. On the 2022 test set dataset, PaleAle 6.0 achieved over 82% accuracy for two-state RSA (RSA_2C) and 59.75% accuracy for four-state RSA (RSA_4C), with a Pearson correlation coefficient (PCC) of 77.88 for real-value RSA prediction. When evaluated on the more challenging 2024 test set, PaleAle 6.0 maintained a strong performance, achieving 79.74% accuracy in the two-state prediction and 55.30% accuracy in the four-state prediction, with a PCC of 73.08 for real-value predictions, outperforming all previously benchmarked predictors.https://www.mdpi.com/2218-273X/15/1/49protein structure predictionstructural bioinformaticsbioinformaticsnatural language processingcomputational biologydeep learning
spellingShingle Wafa Alanazi
Di Meng
Gianluca Pollastri
PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
Biomolecules
protein structure prediction
structural bioinformatics
bioinformatics
natural language processing
computational biology
deep learning
title PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_full PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_fullStr PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_full_unstemmed PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_short PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs)
title_sort paleale 6 0 prediction of protein relative solvent accessibility by leveraging pre trained language models plms
topic protein structure prediction
structural bioinformatics
bioinformatics
natural language processing
computational biology
deep learning
url https://www.mdpi.com/2218-273X/15/1/49
work_keys_str_mv AT wafaalanazi paleale60predictionofproteinrelativesolventaccessibilitybyleveragingpretrainedlanguagemodelsplms
AT dimeng paleale60predictionofproteinrelativesolventaccessibilitybyleveragingpretrainedlanguagemodelsplms
AT gianlucapollastri paleale60predictionofproteinrelativesolventaccessibilitybyleveragingpretrainedlanguagemodelsplms