We are constantly benchmarking CHEESE to validate the technology and critically asses our performance against other state-of-the-art methods. Our goal is to be transparent and find what are the advantages and limitations of the current approach to be able to bridge the gap between scientific innovation and relevance in the industry. Here we present some of the results. Note that CHEESE is still under development and future versions may vary slightly.
Search speed measured on ZINC (700M molecules)1 for 30 molecules retrieved using all metrics (incl. shape and electrostatic). Empirical API request speed (incl. fetching of metadata from database, sorting, on-the-fly property prediction, etc.) is reported in the table below for various settings: "fast", "accurate" and "very accurate". For all metrics, we achieve a subsecond search speed in the "fast" mode and in the "accurate" mode as well (with exception of consensus search).
Correlations and Errors
Since CHEESE is AI-trained on sophisticated metrics with further speed-optimizations such as taking only SMILES as an input for shape and electrostatic embeddings (no conformers need to be computed) we need to check the correlation of distances of vectors in the latent space of CHEESE with ground-truth values. In the latent space of AI-generated embeddings, we are measuring cosine similarity (which is akin to angular distance of vectors) and we are comparing it with the ground-truth values of Tanimoto similarity of 2D Morgan fingerprints, Espsim and Shapesim2. For more detailed explanation of the metrics see Similarity Metrics. The results are shown in the table below.
For the 2D fingerprint model, a pearson correlation of 98% was found on the test set, marking a robust performance. Espsim and Shapesim models show a lesser degree of correlation, 58% and 53% respectively. Similarity of Morgan fingerprints might be more easily trainable than volume or eletrostatic-surface overlap over multiple samples conformers, which is inferently noisy. The same consideration should be made due to the fact that the model is not trained on the 3D information of the molecules, but only on the SMILES strings. The similarity prediction (aka cosine distance computation) in the latent space is unparalleled in speed with the original shapesim and espsim (microseconds vs seconds to minutes per pair of molecules).
CHEESE vs. other state of the art methods
FPSim2,3 the search engine powering Chembl, is currently one of the fastest industry-standard package for similarity search using traditional binary fingerprints along with Chemfp.4 It introduces bounds for sublinear speedups, efficient compression, multicore searches and various C++ optimizations while employing CPU POPCNT instruction.5 Abstractning the AI-Learned embeddings in CHEESE, we can test the vector similarity search data-structure on binary fingerprints to make it comparable with those traditional algorithms. Our benchmark shows that while on smaller datasets the speed is in favor of FPSim2, on larger datasets (1M+ molecules) HNSW (vector similarity data structure used in CHEESE) is significantly faster and it is scaling logarithmically as opposed to linearly (or sublinearly) in FPSim2. On 100M molecules we are 40x faster. Benchmarks were done with default parameters on 1 CPU in on-disk mode for both the methods.
Comprehensive benchmarks in the literature such as MssBenchmark support our claim that vector data structures are promising alternative to traditional methods: "Our results demonstrate that the graph-based methods, such as Hnsw and Onng, consistently achieve the best trade-offbetween searching effectiveness and searching efficiencies."6 Ongoing research increasing the efficiency and scalability of these data structures every year is also supported by the fact that all big-tech companies such as Google, Meta (Facebook), Amazon and Microsoft are using and activelly developing vector data structures suitable for larger and larger datasets (some of the applications are image search, video search, e-shop recommendation engines, advertisement optimization, text search, machine translation). It is then a natural oppurtunity to apply these emerging methods to the chemical space.
Smallworld from NextMove Software7 is a scalable harwdware-centric similarity search based on graph-edit-distance (GED), which is now powering ZINC228 GED is a powerful metric that can be used to search for molecules with similar graph. While it is undoubtedly retrieving similar molecules, their fingerprint, shape or electrostatic similarity scores cannot be guaranteed. From our measurements it seems that GED performs significantly better than random in shape similarity, to a lesser degree in fingerprint (Morgan 2048 radius 2) and electrostatic similarity. The main problem we want to tackle in CHEESE is how to make the results specific to the metric we are searching for in contrast to having a universal database-dependend one.
Combinatorial databases are a special case of databases where the molecules are not stored explicitly, but are generated based on building blocks and allowed reactions. CHEESE is currently not designed for combinatorial databases, but it is possible to use it on them provided that we enumerate the molecules. In same cases, this is even necessary, since the combinatorial definition of the database is not allways publicly available due to proprietary reasons. Existing technologies designed for search on combinatorial databases are allow very fast retrieval of synthesisable nearest neighbors based on the combinatorial rules of the database. Most notably FTrees9, SpaceLight10 or SpaceMACS11. The advantage is almost unlimited scalability due to the fact that the search is performed "on-the-fly" by applying the combinatorial rules without the need of storing the database in memory. Not all databases have combinatorial definiton, being curated by humans instead (ZINC, Pubchem, Chembl) and technologies designed for combinatorial search cannot be used on them. Still, they are not limited in scale and some of them are such as ZINC22 have currently up to 37billion molecules.8 Nonetheless, we see combinatorial search as a promising future direction of possible integration with CHEESE technology.
Is it possible to predict properties from the vector representations on which similarity search is performed? We performed a benchmark on publicly available Therapeutic Data Commons (TDC)12 ADMET datasets. We used simple linear model (SVM) on the top of concatenated fingerprint, shape and electrostatic embeddings from CHEESE. The results are shown in the table below. We are on-par with the fingerprint-based models. Due to its simplicity, on indexed database with precomputed CHEESE embeddings, the prediction is very fast (cca 200ms per 1 million molecules). More detailed benchmarks on what similarity representations are the best for prediction of various properties could lead to further observations and improvements in property prediction. Training embeddings tailor-suited for particular properties may be a viable option too. For more details on the properties see Properties.
Industry-standard ligand-oriented virtual screening benchmark Directory of Useful Decoys (DUD-E)13 was used to compare CHEESE with another 3D shape and electrostatic similarity metrics including the original shapesim and espsim. The results are shown in the table below. CHEESE is on-par with other methods such as ROCS[^14-15] (color), being only slightly worse than the original espsim and shapesim metrics in terms of AUC. USR16 being a fast approximative shape similarity method is comparable in terms of speed to CHEESE, but lacks the accuracy of conformer-alignmend based ones such as shapesim. Results for the original metrics are taken from the supplementary material.17 While CHEESE has only 53-58% correlation with the original metrics, it has comparabable performance in practical downstream tasks such as discriminating between active and inactive compounds as measured by DUD-E across 102 distincs targets.
On targets ‘comt’, ‘esr1’, ‘fabp4’, ‘fpps’, ‘mapk2’, ‘pnph’, ‘pur2’, ‘sahh’, ‘tgfr1’, ‘thb’ and ‘wee1’ CHEESE has higher than 0.9 ROC with some of its models. The results are shown in the following bar chart.
Pairwise similarity (sanity check)
How do shape (or electrostatic) similar and fingerprint dissimilar molecules look like? Are the similarity predictions of CHEESE intuitive? In the figure below, we show some pairs of molecules that are predicted by CHEESE to be shape (or electrostatic) similar and fingerprint dissimilar. Is it possible to say that for instance scaffold hops are shape similar but fingerprint dissimilar? This observation motivated us to work on a prototype of advanced search queries where it is possible to search based on similarity of one kind and dissimilarity of another kind. See Advanced Search Queries prototype for more details.
Limitations of CHEESE
The strategy of utilizing vector data structures from the AI domain to explore chemical space shows considerable potential, particularly with datasets exceeding one million molecules. Its logarithmic scaling (as opposed to linear or sublinear) results in significant improvements in speed and resource-efficiency and can be applied on traditional binary fingerprints independently. The ease of learning traditional fingerprint similarity with AI, demonstrated by a 98% correlation, acts as a crucial sanity check of the capabilities of CHEESE AI models. In the case of shape and electrostatic similarity, there are theoretical accuracy plateaus provided the model is not trained on 3D representations but only SMILES strings. While we can enhance dataset size and the number of model parameters for better generalization over molecular shapes and electrostatic surfaces – a noteworthy example of doing so is ESM-Fold18, an AlphaFold19 competitor from Facebook that learns 3D structure from training on 250 million protein sequences – we still face limitations. Specifically, the model's inability to assess distinct conformers presents a challenge. To allow the model to accurately recognize multiple conformations and molecular surface flexibilities, it appears that supplying the 3D information to the model is the only viable solution, something worth considering in future CHEESE versions.
Nevertheless, results from the DUD-E benchmark reveal that despite the lesser correlations with the original metrics (53% with shapesim and 58% with espsim), CHEESE only slightly underperforms in AUC compared to the ground-truth espsim and shapesim metrics in this ligand-oriented virtual screening benchmark, while being substantially faster. It's on par with other state-of-the-art methods such as ROCS[^14-15] and outperforms similarly speedy metrics like USR16 in terms of AUC. Regarding property prediction, performance on ADMET benchmarks aligns with fingerprint-based models. We haven't identified a significant accuracy advantage with CHEESE beyond the speed.
We generally recommend using CHEESE primarily as a tool for accelerating the search and enhancing the quality of retrieved molecules, rather than as a standalone method for virtual screening that would replace traditional pipelines with resource-intensive computations like docking. This relatively novel technology combines innovative vector data structures with embeddings learned by AI and warrants further exploration to discern its true strengths. To illustrate its scalability and efficiency, we should undertake a thorough evaluation on diverse datasets, including some of the largest available enumerative databases. The versatility of the CHEESE AI architecture should be tested on various downstream tasks, and the latent space of the embeddings should be investigated to build on the top of the metric interpretability of its vector representations. For latest updates in this are check CHEESE Prototypes.
Citing the post
ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery John J. Irwin, Khanh G. Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R. Wong, Munkhzul Khurelbaatar, Yurii S. Moroz, John Mayfield, and Roger A. Sayle Journal of Chemical Information and Modeling 2020 60 (12), 6065-6073 DOI: 10.1021/acs.jcim.0c00675 ↩
Bolcato G, Heid E, Boström J. On the Value of Using 3D Shape and Electrostatic Similarities in Deep Generative Methods. J Chem Inf Model. 2022 Mar 28;62(6):1388-1398. ↩
Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time S. Joshua Swamidass and Pierre Baldi Journal of Chemical Information and Modeling 2007 47 (2), 302-317 DOI: 10.1021/ci600358f ↩
Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 ↩
FPSim2 Github Documentation: https://chembl.github.io/FPSim2/ ↩
Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search Chun Jiang Zhu, Minghu Song, Qinqing Liu, Chloé Becquey, and Jinbo Bi Journal of Chemical Information and Modeling 2020 60 (12), 6167-6184 DOI: 10.1021/acs.jcim.0c00393 ↩
NextMove Software: https://www.nextmovesoftware.com/ ↩
ZINC-22─A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery Benjamin I. Tingle, Khanh G. Tang, Mar Castanon, John J. Gutierrez, Munkhzul Khurelbaatar, Chinzorig Dandarchuluun, Yurii S. Moroz, and John J. Irwin Journal of Chemical Information and Modeling 2023 63 (4), 1166-1176 DOI: 10.1021/acs.jcim.2c01253 ↩↩
Huang, K., Fu, T., Gao, W. et al. Artificial intelligence foundation for therapeutic science. Nat Chem Biol 18, 1033–1036 (2022). https://doi.org/10.1038/s41589-022-01131-2 ↩
Rarey, M., Dixon, J.S. Feature trees: A new molecular similarity measure based on tree matching. J Comput Aided Mol Des 12, 471–490 (1998). https://doi.org/10.1023/A:1008068904628 ↩
Topological Similarity Search in Large Combinatorial Fragment Spaces Louis Bellmann, Patrick Penner, and Matthias Rarey Journal of Chemical Information and Modeling 2021 61 (1), 238-251 DOI: 10.1021/acs.jcim.0c00850 ↩
Maximum Common Substructure Searching in Combinatorial Make-on-Demand Compound Spaces Robert Schmidt, Raphael Klein, and Matthias Rarey Journal of Chemical Information and Modeling 2022 62 (9), 2133-2150 DOI: 10.1021/acs.jcim.1c00640 ↩
Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking Michael M. Mysinger, Michael Carchia, John. J. Irwin, and Brian K. Shoichet Journal of Medicinal Chemistry 2012 55 (14), 6582-6594 DOI: 10.1021/jm300687e ↩
OpenEYE ROCS: https://www.eyesopen.com/rocs ↩
Comparison of Shape-Matching and Docking as Virtual Screening Tools Paul C. D. Hawkins, A. Geoffrey Skillman, and Anthony Nicholls Journal of Medicinal Chemistry 2007 50 (1), 74-82 DOI: 10.1021/jm0603365 ↩
On the Value of Using 3D Shape and Electrostatic Similarities in Deep Generative Methods Giovanni Bolcato, Esther Heid, and Jonas Boström Journal of Chemical Information and Modeling 2022 62 (6), 1388-1398 DOI: 10.1021/acs.jcim.1c01535 ↩
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118, e2016239118 (2021). ↩
Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2 ↩