Very few pharmaceutical molecules are observed in any of the data sets with more than 5 aromatic rings, with all 3 sets falling behind the general organic molecules at this point.
A similar trend is observed in both the industrial sets however, it is shifted toward higher numbers of rings, peaking at 2 for Pfizer and 3 for AZ. For drugs in the CSD, however, we see a peak in structures with one ring, and then a sharp decline after 3. It would also appear from this result that molecules currently of interest to the pharmaceutical industry may be larger than drugs have been historically, and this perhaps reflects a change in the range of medical conditions modern pharmaceuticals are developed to treat.įor organic molecules in the CSD, we observe a shallow decline in the frequency of structures observed as the number of rings increases, with very few structures observed with 5 or more ( Appendix Fig. The skewing of the peak range to heavier weights relative to that seen in the Feher data set suggests that more recently approved drugs may be generally of higher molecular weight. The drug subset contains many recently approved drugs, but also a large quantity of drugs from throughout the 20th century. Very few molecules are seen <100 g/mol in any pharmaceutical data set, whereas ∼13% of the CSD organics fell in this range. However, both industrial data sets trend significantly toward even larger molecules than the CSD drugs, with AZ molecules peaking in the 500-600 g/mol range. The molecules of the industrial data sets also show a significant population in this range, with Pfizer molecules peaking in this band. As the chemistry of drugs evolves over time (alongside the range of conditions they are developed to treat), there is a danger that the use of historic data sets may result in an outdated perspective on the sorts of molecules with the potential to become drugs.įrom this analysis ( Fig. 2 ), it can be seen that for the CSD Drug subset, the most common weight range is between 300 and 400 g/mol, which is a deviation toward larger molecules than seen in the Feher data set. Alongside this, we took the opportunity offered by our pharmaceutical collaborators in the ADDoPT project to investigate how relevant this new data set is to the chemical space being explored by the modern pharmaceutical industry. It is hoped that this data will become more freely available if the value in releasing this information is made more apparent. To maximize the potential impact of this data set, and produce reliable models regarding the structural properties of pharmaceuticals such as solubility, mechanical behavior, form stability etc., it is important that we are able to match any experimental data to the correct crystal form. We have, however, discovered that access to crystal information regarding pharmaceuticals is difficult to obtain, specifically regarding that of the marketed polymorphic form. KeywordsĪnd perhaps draw new links between chemical and crystal properties. In addition to this, as part of the Advanced Digital Design of Pharmaceutical Therapeutics collaboration between academia and industry, we have been given the unique opportunity to run comparative analysis on the internal crystal structure databases of AstraZeneca and Pfizer, alongside comparison to the CSD as a whole. We hope that this new resource will lead to improvements in targeted cheminformatics and statistical model building in a pharmaceutical setting. This has resulted in a subset of 8632 crystal structures, representing all published solid forms of 785 unique drug molecules. By making use of InChI matching, a CSD Python API workflow to link CSD entries to the online database Drugbank.ca has been produced. We report the generation and statistical analysis of the CSD drug subset: a subset of the Cambridge Structural Database (CSD) consisting of every published small-molecule crystal structure containing an approved drug molecule.