SMILES all around: structure to SMILES conversion for transition metal complexes

Abstract We present a method for creating RDKit-parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which ma...

Full description

Saved in:
Bibliographic Details
Main Authors: Maria H. Rasmussen, Magnus Strandgaard, Julius Seumer, Laura K. Hemmingsen, Angelo Frei, David Balcells, Jan H. Jensen
Format: Article
Language:English
Published: BMC 2025-04-01
Series:Journal of Cheminformatics
Online Access:https://doi.org/10.1186/s13321-025-01008-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849314840088150016
author Maria H. Rasmussen
Magnus Strandgaard
Julius Seumer
Laura K. Hemmingsen
Angelo Frei
David Balcells
Jan H. Jensen
author_facet Maria H. Rasmussen
Magnus Strandgaard
Julius Seumer
Laura K. Hemmingsen
Angelo Frei
David Balcells
Jan H. Jensen
author_sort Maria H. Rasmussen
collection DOAJ
description Abstract We present a method for creating RDKit-parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with. Therefore, sanity-checking a method has required manual work. Therefore, we also generate SMILES two other ways; one where ligand charges and TMC connectivity are based on natural bond orbital (NBO) analysis from density functional theory (DFT) calculations utilizing recent work by Kneiding et al. (Digit Discov 2: 618–633, 2023). Another one fixes SMILES available through the Cambridge Structural Database (CSD), making them parsable by RDKit. We compare these three different ways of obtaining SMILES for a subset of the CSD (tmQMg) and find >70% agreement for all three pairs. We utilize these SMILES to make simple molecular fingerprint (FP) and graph-based representations of the molecules to be used in the context of machine learning. Comparing with the graphs made by Kneiding et al. where nodes and edges are featurized with DFT properties, we find that depending on the target property (polarizability, HOMO-LUMO gap or dipole moment) the SMILES based representations can perform equally well. This makes them very suitable as baseline-models. Finally we present a dataset of 227k RDKit parsable SMILES for mononuclear TMCs in the CSD. Scientific contribution We present a method that can create RDKit-parsable SMILES strings of transition metal complexes (TMCs) from Cartesian coordinates and use it to create a dataset of 227k TMC SMILES strings. The RDKit-parsability allows us to generate perform machine learning studies of TMC properties using ”standard” molecular representations such as fingerprints and 2D-graph convolution. We show that these relatively simple representations can perform quite well depending on the target property.
format Article
id doaj-art-e5fcb51dc35346d1bf16773fa9c6bcfc
institution Kabale University
issn 1758-2946
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-e5fcb51dc35346d1bf16773fa9c6bcfc2025-08-20T03:52:19ZengBMCJournal of Cheminformatics1758-29462025-04-0117111310.1186/s13321-025-01008-1SMILES all around: structure to SMILES conversion for transition metal complexesMaria H. Rasmussen0Magnus Strandgaard1Julius Seumer2Laura K. Hemmingsen3Angelo Frei4David Balcells5Jan H. Jensen6Department of Chemistry, University of CopenhagenDepartment of Chemistry, University of CopenhagenDepartment of Chemistry, University of CopenhagenDepartment of Chemistry, University of CopenhagenDepartment of Chemistry, University of YorkDepartment of Chemistry, University of OsloDepartment of Chemistry, University of CopenhagenAbstract We present a method for creating RDKit-parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with. Therefore, sanity-checking a method has required manual work. Therefore, we also generate SMILES two other ways; one where ligand charges and TMC connectivity are based on natural bond orbital (NBO) analysis from density functional theory (DFT) calculations utilizing recent work by Kneiding et al. (Digit Discov 2: 618–633, 2023). Another one fixes SMILES available through the Cambridge Structural Database (CSD), making them parsable by RDKit. We compare these three different ways of obtaining SMILES for a subset of the CSD (tmQMg) and find >70% agreement for all three pairs. We utilize these SMILES to make simple molecular fingerprint (FP) and graph-based representations of the molecules to be used in the context of machine learning. Comparing with the graphs made by Kneiding et al. where nodes and edges are featurized with DFT properties, we find that depending on the target property (polarizability, HOMO-LUMO gap or dipole moment) the SMILES based representations can perform equally well. This makes them very suitable as baseline-models. Finally we present a dataset of 227k RDKit parsable SMILES for mononuclear TMCs in the CSD. Scientific contribution We present a method that can create RDKit-parsable SMILES strings of transition metal complexes (TMCs) from Cartesian coordinates and use it to create a dataset of 227k TMC SMILES strings. The RDKit-parsability allows us to generate perform machine learning studies of TMC properties using ”standard” molecular representations such as fingerprints and 2D-graph convolution. We show that these relatively simple representations can perform quite well depending on the target property.https://doi.org/10.1186/s13321-025-01008-1
spellingShingle Maria H. Rasmussen
Magnus Strandgaard
Julius Seumer
Laura K. Hemmingsen
Angelo Frei
David Balcells
Jan H. Jensen
SMILES all around: structure to SMILES conversion for transition metal complexes
Journal of Cheminformatics
title SMILES all around: structure to SMILES conversion for transition metal complexes
title_full SMILES all around: structure to SMILES conversion for transition metal complexes
title_fullStr SMILES all around: structure to SMILES conversion for transition metal complexes
title_full_unstemmed SMILES all around: structure to SMILES conversion for transition metal complexes
title_short SMILES all around: structure to SMILES conversion for transition metal complexes
title_sort smiles all around structure to smiles conversion for transition metal complexes
url https://doi.org/10.1186/s13321-025-01008-1
work_keys_str_mv AT mariahrasmussen smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes
AT magnusstrandgaard smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes
AT juliusseumer smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes
AT laurakhemmingsen smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes
AT angelofrei smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes
AT davidbalcells smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes
AT janhjensen smilesallaroundstructuretosmilesconversionfortransitionmetalcomplexes