Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks

Recent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstrea...

Full description

Saved in:
Bibliographic Details
Main Authors: Eduardo Soares, Victor Yukio Shirasuna, Emilio Vital Brazil, Karen Fiorella Aquino Gutierrez, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt, Daniel P Sanders
Format: Article
Language:English
Published: IOP Publishing 2025-01-01
Series:Machine Learning: Science and Technology
Subjects:
Online Access:https://doi.org/10.1088/2632-2153/adabb1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832582664120434688
author Eduardo Soares
Victor Yukio Shirasuna
Emilio Vital Brazil
Karen Fiorella Aquino Gutierrez
Renato Cerqueira
Dmitry Zubarev
Kristin Schmidt
Daniel P Sanders
author_facet Eduardo Soares
Victor Yukio Shirasuna
Emilio Vital Brazil
Karen Fiorella Aquino Gutierrez
Renato Cerqueira
Dmitry Zubarev
Kristin Schmidt
Daniel P Sanders
author_sort Eduardo Soares
collection DOAJ
description Recent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstream tasks. This methodology reduces reliance on labeled data, facilitating data acquisition and broadening the scope of chemical language representation. However, real-world scenarios often pose challenges due to domain shift, a phenomenon where the data distribution in downstream tasks differs from that of the pre-training phase, potentially degrading model performance. To address this, we present a novel causal-based framework for feature selection and domain adaptation to enhance the performance of chemical foundation models on downstream tasks. Our approach employs a multi-stage feature selection method that identifies physico-chemical features based on their direct causal-effect over specific downstream properties. By employing Mordred descriptors and Markov blanket causal graphs, our approach provides insight into the causal relationships between features and target properties for prediction tasks. We evaluate our approach on various foundation model architectures and datasets, demonstrating performance improvements, which showcases the robustness and the agnostic nature of our approach.
format Article
id doaj-art-10e4d3c4a99342959a644379e57b2010
institution Kabale University
issn 2632-2153
language English
publishDate 2025-01-01
publisher IOP Publishing
record_format Article
series Machine Learning: Science and Technology
spelling doaj-art-10e4d3c4a99342959a644379e57b20102025-01-29T10:42:31ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016101501710.1088/2632-2153/adabb1Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasksEduardo Soares0https://orcid.org/0000-0002-2634-8270Victor Yukio Shirasuna1Emilio Vital Brazil2https://orcid.org/0000-0003-4982-3836Karen Fiorella Aquino Gutierrez3Renato Cerqueira4Dmitry Zubarev5Kristin Schmidt6Daniel P Sanders7IBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , Rio de Janeiro, Rio de Janeiro 20031-170, BrazilIBM Research , San Jose, CA 95120, United States of AmericaIBM Research , San Jose, CA 95120, United States of AmericaIBM Research , San Jose, CA 95120, United States of AmericaRecent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstream tasks. This methodology reduces reliance on labeled data, facilitating data acquisition and broadening the scope of chemical language representation. However, real-world scenarios often pose challenges due to domain shift, a phenomenon where the data distribution in downstream tasks differs from that of the pre-training phase, potentially degrading model performance. To address this, we present a novel causal-based framework for feature selection and domain adaptation to enhance the performance of chemical foundation models on downstream tasks. Our approach employs a multi-stage feature selection method that identifies physico-chemical features based on their direct causal-effect over specific downstream properties. By employing Mordred descriptors and Markov blanket causal graphs, our approach provides insight into the causal relationships between features and target properties for prediction tasks. We evaluate our approach on various foundation model architectures and datasets, demonstrating performance improvements, which showcases the robustness and the agnostic nature of our approach.https://doi.org/10.1088/2632-2153/adabb1foundation modelscausalitymolecular properties predictionQM9
spellingShingle Eduardo Soares
Victor Yukio Shirasuna
Emilio Vital Brazil
Karen Fiorella Aquino Gutierrez
Renato Cerqueira
Dmitry Zubarev
Kristin Schmidt
Daniel P Sanders
Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
Machine Learning: Science and Technology
foundation models
causality
molecular properties prediction
QM9
title Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
title_full Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
title_fullStr Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
title_full_unstemmed Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
title_short Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
title_sort causality driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks
topic foundation models
causality
molecular properties prediction
QM9
url https://doi.org/10.1088/2632-2153/adabb1
work_keys_str_mv AT eduardosoares causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT victoryukioshirasuna causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT emiliovitalbrazil causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT karenfiorellaaquinogutierrez causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT renatocerqueira causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT dmitryzubarev causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT kristinschmidt causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks
AT danielpsanders causalitydrivenfeatureselectionanddomainadaptationforenhancingchemicalfoundationmodelsindownstreamtasks