Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
Recent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstrea...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IOP Publishing
2025-01-01
|
Series: | Machine Learning: Science and Technology |
Subjects: | |
Online Access: | https://doi.org/10.1088/2632-2153/adabb1 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832582664120434688 |
---|---|
author | Eduardo Soares Victor Yukio Shirasuna Emilio Vital Brazil Karen Fiorella Aquino Gutierrez Renato Cerqueira Dmitry Zubarev Kristin Schmidt Daniel P Sanders |
author_facet | Eduardo Soares Victor Yukio Shirasuna Emilio Vital Brazil Karen Fiorella Aquino Gutierrez Renato Cerqueira Dmitry Zubarev Kristin Schmidt Daniel P Sanders |
author_sort | Eduardo Soares |
collection | DOAJ |
description | Recent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstream tasks. This methodology reduces reliance on labeled data, facilitating data acquisition and broadening the scope of chemical language representation. However, real-world scenarios often pose challenges due to domain shift, a phenomenon where the data distribution in downstream tasks differs from that of the pre-training phase, potentially degrading model performance. To address this, we present a novel causal-based framework for feature selection and domain adaptation to enhance the performance of chemical foundation models on downstream tasks. Our approach employs a multi-stage feature selection method that identifies physico-chemical features based on their direct causal-effect over specific downstream properties. By employing Mordred descriptors and Markov blanket causal graphs, our approach provides insight into the causal relationships between features and target properties for prediction tasks. We evaluate our approach on various foundation model architectures and datasets, demonstrating performance improvements, which showcases the robustness and the agnostic nature of our approach. |
format | Article |
id | doaj-art-10e4d3c4a99342959a644379e57b2010 |
institution | Kabale University |
issn | 2632-2153 |
language | English |
publishDate | 2025-01-01 |
publisher | IOP Publishing |
record_format | Article |
series | Machine Learning: Science and Technology |
spelling | doaj-art-10e4d3c4a99342959a644379e57b20102025-01-29T10:42:31ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016101501710.1088/2632-2153/adabb1Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasksEduardo Soares0https://orcid.org/0000-0002-2634-8270Victor Yukio Shirasuna1Emilio Vital Brazil2https://orcid.org/0000-0003-4982-3836Karen Fiorella Aquino Gutierrez3Renato Cerqueira4Dmitry Zubarev5Kristin Schmidt6Daniel P Sanders7IBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , San Jose, CA 95120, United States of AmericaIBM Research , San Jose, CA 95120, United States of AmericaIBM Research , San Jose, CA 95120, United States of AmericaRecent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstream tasks. This methodology reduces reliance on labeled data, facilitating data acquisition and broadening the scope of chemical language representation. However, real-world scenarios often pose challenges due to domain shift, a phenomenon where the data distribution in downstream tasks differs from that of the pre-training phase, potentially degrading model performance. To address this, we present a novel causal-based framework for feature selection and domain adaptation to enhance the performance of chemical foundation models on downstream tasks. Our approach employs a multi-stage feature selection method that identifies physico-chemical features based on their direct causal-effect over specific downstream properties. By employing Mordred descriptors and Markov blanket causal graphs, our approach provides insight into the causal relationships between features and target properties for prediction tasks. We evaluate our approach on various foundation model architectures and datasets, demonstrating performance improvements, which showcases the robustness and the agnostic nature of our approach.https://doi.org/10.1088/2632-2153/adabb1foundation modelscausalitymolecular properties predictionQM9 |
spellingShingle | Eduardo Soares Victor Yukio Shirasuna Emilio Vital Brazil Karen Fiorella Aquino Gutierrez Renato Cerqueira Dmitry Zubarev Kristin Schmidt Daniel P Sanders Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks Machine Learning: Science and Technology foundation models causality molecular properties prediction QM9 |
title | Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks |
title_full | Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks |
title_fullStr | Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks |
title_full_unstemmed | Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks |
title_short | Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks |
title_sort | causality driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks |
topic | foundation models causality molecular properties prediction QM9 |
url | https://doi.org/10.1088/2632-2153/adabb1 |
work_keys_str_mv | AT eduardosoares causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT victoryukioshirasuna causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT emiliovitalbrazil causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT karenfiorellaaquinogutierrez causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT renatocerqueira causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT dmitryzubarev causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT kristinschmidt causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks AT danielpsanders causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks |