AXXAM

Science Spyglass

Building AXXVirtual: a chemistry-driven virtual library for drug discovery

In the same way as in high-throughput screening (HTS), the quality of the screened library plays a crucial role in the success of virtual screening. While virtual screening enables the exploration of much broader and more diverse chemical spaces, many virtual libraries are populated with molecules that, while computationally attractive, are difficult — or even impossible — to synthesize. For many drug discovery programs, this stage represents a major bottleneck: synthesis can be slow, unpredictable, and resource-intensive, delaying the confirmation of biological activity and the chemical exploration of the promising compounds.

This is precisely the gap that the AXXVirtual library was designed to overcome. Beyond ensuring high-quality, drug-like chemical space, this 185 million non-commercial small molecule library was built with synthesis feasibility at its core.

Every AXXVirtual compound can be produced in just 2–3 steps from readily available building blocks. This unique design guarantees that virtual hits are not just theoretical possibilities but tangible molecules, accessible within controlled and predictable timelines. As a result, AXXVirtual enables researchers to reach in vitro confirmation faster, accelerating the path from virtual screening to validated hits.

Developed through a structured four-stage process, the compounds were rigorously selected by applying strict rules and filters to guarantee drug-like properties and structural diversity, thereby enabling efficient downstream development.

This article walks you through the principles behind building a high-quality virtual library for drug discovery and shows how these concepts were applied in the design of the 185 million compound AXXVirtual library.

Designing for the real lab: the synthetic accessibility

Synthetic feasibility has become a key parameter in the design of virtual libraries, ensuring that computational efforts translate into compounds that can be readily produced for downstream testing [1, 2]. The core of this approach lies in relying on synthetic routes based on established reaction classes that have demonstrated their value over decades — including, for instance, amide coupling and the Suzuki–Miyaura reaction. Despite the emergence of new methodologies, these reactions remain the backbone of medicinal chemistry due to their efficiency, reproducibility, scalability, and high yields [3].

Built on the concepts previously described, the AXXVirtual compounds have been designed to be synthesized through twelve synthetic routes, each consisting of two to three steps, employing nine reliable reactions. The building blocks, more than 12.000 in total, were selected from the inventory of a trusted partner and are immediately available, eliminating delays from external orders. In addition, the reagents have been carefully selected to ensure clean reactions, minimizing side products and regioisomer formation.

This thoughtful combination of proven chemistry and readily available reagents enables a fast and efficient synthesis, allowing the preparation of 100-120 compounds within just two to three weeks. To maintain these standards, the library is regularly updated in line with the partner’s inventory, making AXXVirtual a dynamic and continuously evolving library.

Designing smarter: AI-powered properties and synthetic feasibility prediction

Artificial intelligence (AI) is playing an increasingly important role in the landscape of virtual libraries for drug discovery by enabling more accurate and efficient predictions of molecular properties and synthetic accessibility. Today, a variety of machine learning models are employed to predict molecular properties with increasing accuracy and speed. These models rely heavily on large and curated training sets – databases of molecules with known experimental properties – to identify patterns and relationships between molecular features, such as size, chemical groups, and shape, and their observed behaviors, such as solubility and toxicity. Unlike simple rule-based methods, machine learning adapts to the complexity and the variability inherent in chemical data and this allows it to capture subtle influences and nonlinear effects that traditional rules often miss.

For synthetic accessibility, tools like RAscore (Retrosynthetic Accessibility Score) [8] are widely used. RAscore is a machine learning classifier trained on the outcomes of the retrosynthetic planning software AiZynthFinder. Instead of running a full retrosynthetic analysis for each molecule — which is impractical when dealing with millions of compounds — RAscore provides a rapid estimate of whether a compound is likely to be synthesizable using known building blocks and reaction rules.

Applying RAscore to evaluate AXXVirtual compounds, we found that the vast majority (96%) scored above 0.8 on the 0-to-1 scale, confirming their high synthetic accessibility. This result further highlights the robustness of the chemistry underpinning our library.

Designing for success: from synthesizable to developable molecules

While synthetic accessibility defines what can be built, drug-likeness defines what is worth pursuing. Virtual libraries should not only contain compounds that are synthetically feasible, but also exhibit molecular properties that make them suitable candidates for future development. This includes properties that impact solubility, permeability, metabolic stability, and safety.

The concept of drug-likeness is grounded in the empirical observation of properties shared by orally bioavailable drugs. Large-scale analyses of marketed drugs and clinical candidates have revealed that certain molecular properties – such as moderate size and balanced lipophilicity – are associated with favorable pharmacokinetic behavior. These findings led to the formulation of guidelines, with Lipinski’s Rule of Five (Ro5) [4] and Veber’s rules [5] being among the most well-known and widely adopted.

In parallel with physicochemical profiling, the quality of chemical libraries, including virtual ones, must be ensured by excluding compounds known to cause assay interference or unreliable readouts. A major class of such problematic molecules is represented by PAINS (Pan-Assay Interference compoundS), which are chemical structures prone to react nonspecifically with numerous biological targets rather than specifically affecting one desired target [6].

Rhodanines exemplify the extent of the problem. More than 2.000 rhodanines have been reported to have biological activity in over 400 papers. However, a publication by Bristol-Myers Squibb points out that these compounds undergo light-induced reactions that irreversibly modify proteins. It is hard to imagine how such a mechanism could be optimized to produce a drug or a useful tool [7].

At Axxam, more than 20 years of experience with physical libraries have given us deep insight into selecting the right compounds suitable for drug discovery campaigns. Our physical collections are curated using rigorous filters that remove PAINS, toxicophores, and compounds with poor drug-likeness profiles, ensuring that they contain high-quality, developable compounds. By applying the same meticulous approach that has guided the success of our physical collections to AXXVirtual, we have ensured that its virtual compounds are not only readily synthesizable but also carefully chosen for downstream development.

Designing with diversity: navigating chemical space

Clustering generated compounds is crucial for creating a virtual library that explores a broad chemical space and avoids redundancy. A chemically diverse library better represents the breadth of chemical space, maximizing the chance of discovering hit compounds.

The Tanimoto coefficient is probably the most common metric used to measure similarity between two molecules based on their fingerprints, which are bit strings that encode the presence or the absence of specific structural features. While effective for small datasets, calculating and storing a full Tanimoto similarity matrix across millions of compounds becomes computationally infeasible due to its quadratic growth in size.

This limitation has led to the development of more scalable approaches. Modern strategies focus on reducing computational demands by avoiding the need for exhaustive pairwise comparisons. This challenge has led to the use of more scalable methods that can cluster compounds without comparing every possible pair.

One such method is the leader clustering algorithm, which processes molecules sequentially and assigns them to the first cluster whose representative (the “leader”) exceeds a given similarity threshold. If no suitable cluster is found, a new one is created. Another powerful example is BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies). BIRCH incrementally builds a hierarchical tree structure that summarizes the dataset and enables efficient clustering by compressing information early in the process.

Clustering generated virtual compounds

Finally, to explore and interpret chemical diversity intuitively, 2D visualization techniques play a crucial role. Methods such as t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) enable the projection of high-dimensional molecular representations —such as fingerprints — into an easily interpretable two-dimensional space. These visualizations allow for the visual identification of clusters and outliers.

To assess chemical diversity, a t-SNE analysis was performed comparing one million subset of AXXVirtual with our physical library AXXDiversity. As shown in the figure, the results revealed minimal overlap between the chemical spaces explored by AXXVirtual (green) and AXXDiversity (blue). This complementary relationship demonstrates that AXXVirtual not only reflects the robustness of the physical collection but also expands its scope, extending the chemical space accessible for drug discovery.

Conclusion

The landscape of drug discovery is rapidly being reshaped by advances in computational chemistry and artificial intelligence. In this context, the development of high-quality virtual libraries has become a strategic cornerstone.

Designing an effective virtual library for drug discovery is not simply about assembling a collection of compounds; it requires a deliberate balance between synthetic feasibility, drug-likeness, and chemical diversity. By integrating these elements thoughtfully, virtual libraries can significantly accelerate the drug discovery process, transforming computational predictions into real-world therapeutic opportunities.

AXXVirtual exemplifies this integrated approach. Built upon robust synthetic routes and carefully selected reagents, it provides a collection of compounds that are readily synthesizable. The immediate availability of reagents enables the rapid synthesis of the most promising compounds after the virtual screening. Moreover, rigorous filters for drug-likeness were applied to ensure that molecules identified as hits can be further developed. A sample analysis shows that 99.8% of the compounds are not commercially available. Together with the t-SNE analysis results showing minimal overlap with AXXDiversity, this confirms that AXXVirtual explores a complementary and largely untapped chemical space.

  1. Nicolaou, C. A.; Watson, I. A.; Hu, H.; Wang, J. The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space. J. Chem. Inf. Model. 2016, 56 (7), 1253–1266. https://doi.org/10.1021/acs.jcim.6b00173.
  2. Hu, Q.; Peng, Z.; Sutton, S. C.; Na, J.; Kostrowicki, J.; Yang, B.; Thacher, T.; Kong, X.; Mattaparti, S.; Zhou, J. Z.; Gonzalez, J.; Ramirez-Weinhouse, M.; Kuki, A. Pfizer Global Virtual Library (PGVL): A Chemistry Design Tool Powered by Experimentally Validated Parallel Synthesis Information. ACS Comb. Sci. 2012, 14 (11), 579–589. https://doi.org/10.1021/co300096q.
  3. Brown, D. G.; Boström, J. Analysis of Past and Present Synthetic Methodologies on Medicinal Chemistry: Where Have All the New Reactions Gone? J. Med. Chem. 2016, 59 (10), 4443–4458. https://doi.org/10.1021/acs.jmedchem.5b01409.
  4. Lipinski, C. A. Lead- and Drug-Like Compounds: The Rule-of-Five Revolution. Drug Discov. Today Technol. 2004, 1 (4), 337–341. https://doi.org/10.1016/j.ddtec.2004.11.007.
  5. Veber, D. F.; Johnson, S. R.; Cheng, H.-Y.; Smith, B. R.; Ward, K. W.; Kopple, K. D. Molecular Properties That Influence the Oral Bioavailability of Drug Candidates. J. Med. Chem. 2002, 45 (12), 2615–2623. https://doi.org/10.1021/jm020017n.
  6. Baell, J.; Walters, M. A. Chemistry: Chemical Con Artists Foil Drug Discovery. Nature 2014, 513 (7519), 481–483. https://doi.org/10.1038/513481a.
  7. Voss, M. E.; Carter, P. H.; Tebben, A. J.; Scherle, P. A.; Brown, G. D.; Thompson, L. A.; Xu, M.; Lo, Y. C.; Yang, G.; Liu, R.-Q.; Strzemienski, P.; Everlof, J. G.; Trzaskos, J. M.; Decicco, C. P. Both 5-Arylidene-2-thioxodihydropyrimidine-4,6(1H,5H)-diones and 3-Thioxo-2,3-dihydro-1H-imidazo[1,5-a]indol-1-ones Are Light-Dependent Tumor Necrosis Factor-α Antagonists. Bioorg. Med. Chem. Lett. 2003, 13 (3), 533–538. https://doi.org/10.1016/S0960-894X(02)00941-1.
  8. Thakkar, A.; Chadimová, V.; Bjerrum, E. J.; Engkvist, O.; Reymond, J.-L. Retrosynthetic Accessibility Score (RAscore) – Rapid Machine Learned Synthesizability Classification from AI Driven Retrosynthetic Planning. Chem. Sci. 2021, 12, 3339–3349. https://doi.org/10.1039/D0SC05401A.  

Related content

Poster preview - AXXVirtual design

Download the poster illustrating the design of our AXXVirtual library, a multi-million-compound virtual library for drug discovery

Scroll to Top