Matched pairs demonstrate robustness against inter-assay variability

Abstract Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences bet...

Full description

Saved in:
Bibliographic Details
Main Authors: Jochem Nelen, Horacio Pérez-Sánchez, Hans De Winter, Dries Van Rompaey
Format: Article
Language:English
Published: BMC 2025-01-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-025-00956-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585446209617920
author Jochem Nelen
Horacio Pérez-Sánchez
Hans De Winter
Dries Van Rompaey
author_facet Jochem Nelen
Horacio Pérez-Sánchez
Hans De Winter
Dries Van Rompaey
author_sort Jochem Nelen
collection DOAJ
description Abstract Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences between compounds are often assumed to be consistent. This study evaluates that assumption by analyzing potency differences between matched compound pairs across assays and assessing the impact of assay metadata curation on error reduction. We find that potency differences between matched pairs exhibit less variability than individual compound measurements, suggesting systematic assay differences may partially cancel out in paired data. Metadata curation further improves inter-assay agreement, albeit at the cost of dataset size. For minimally curated compound pairs, agreement within 0.3 pChEMBL units was found to be 44–46% for Ki and IC50 values respectively, which improved to 66–79% after curation. Similarly, the percentage of pairs with differences exceeding 1 pChEMBL unit dropped from 12 to 15% to 6–8% with extensive curation. These results establish a benchmark for expected noise in matched molecular pair data from the ChEMBL database, offering practical metrics for data quality assessment.
format Article
id doaj-art-aafb622c6d904906a388da1a4337bd31
institution Kabale University
issn 1758-2946
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-aafb622c6d904906a388da1a4337bd312025-01-26T12:50:04ZengBMCJournal of Cheminformatics1758-29462025-01-011711810.1186/s13321-025-00956-yMatched pairs demonstrate robustness against inter-assay variabilityJochem Nelen0Horacio Pérez-Sánchez1Hans De Winter2Dries Van Rompaey3Structural Bioinformatics and High Performance Computing Research Group (BIO-HPC), HiTech Innovation Hub, UCAM Universidad Católica de MurciaStructural Bioinformatics and High Performance Computing Research Group (BIO-HPC), HiTech Innovation Hub, UCAM Universidad Católica de MurciaDepartment of Pharmaceutical Sciences, Faculty of Pharmaceutical, Biomedical and Veterinary Sciences, University of AntwerpDrug Discovery Data Sciences, Janssen Pharmaceutica NVAbstract Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences between compounds are often assumed to be consistent. This study evaluates that assumption by analyzing potency differences between matched compound pairs across assays and assessing the impact of assay metadata curation on error reduction. We find that potency differences between matched pairs exhibit less variability than individual compound measurements, suggesting systematic assay differences may partially cancel out in paired data. Metadata curation further improves inter-assay agreement, albeit at the cost of dataset size. For minimally curated compound pairs, agreement within 0.3 pChEMBL units was found to be 44–46% for Ki and IC50 values respectively, which improved to 66–79% after curation. Similarly, the percentage of pairs with differences exceeding 1 pChEMBL unit dropped from 12 to 15% to 6–8% with extensive curation. These results establish a benchmark for expected noise in matched molecular pair data from the ChEMBL database, offering practical metrics for data quality assessment.https://doi.org/10.1186/s13321-025-00956-yMatched structural pairsAssay noiseData curationChEMBLMachine learning
spellingShingle Jochem Nelen
Horacio Pérez-Sánchez
Hans De Winter
Dries Van Rompaey
Matched pairs demonstrate robustness against inter-assay variability
Journal of Cheminformatics
Matched structural pairs
Assay noise
Data curation
ChEMBL
Machine learning
title Matched pairs demonstrate robustness against inter-assay variability
title_full Matched pairs demonstrate robustness against inter-assay variability
title_fullStr Matched pairs demonstrate robustness against inter-assay variability
title_full_unstemmed Matched pairs demonstrate robustness against inter-assay variability
title_short Matched pairs demonstrate robustness against inter-assay variability
title_sort matched pairs demonstrate robustness against inter assay variability
topic Matched structural pairs
Assay noise
Data curation
ChEMBL
Machine learning
url https://doi.org/10.1186/s13321-025-00956-y
work_keys_str_mv AT jochemnelen matchedpairsdemonstraterobustnessagainstinterassayvariability
AT horacioperezsanchez matchedpairsdemonstraterobustnessagainstinterassayvariability
AT hansdewinter matchedpairsdemonstraterobustnessagainstinterassayvariability
AT driesvanrompaey matchedpairsdemonstraterobustnessagainstinterassayvariability