A measure of chemical similarity is only useful if it implies similarity in some relevant property space. Typically, similarity calculations operate by assigning each molecule a chemical fingerprint: a fixed-length vector of bits where the on-bits signify the presence of a certain feature. Common fingerprinting schemes, such as extended-connectivity fingerprints, are by definition general and fail to capture much of the domain-specific theory that underpins similarity in a specific domain. In this work, a hierarchical fingerprinting scheme is developed that is bespoke to a database of ∼4500 organic molecules and their cognate optical absorption spectral properties. Our fingerprinting scheme incorporates molecular fragmentation and domain-specific chemical intuition into an algorithm that categorizes each fragment as being one of a core chemical group, a substituent, or a bridge. The algorithm is applied to every molecule in the database to generate a pool of chemically relevant fragments that are labeled according to their structural category. The fingerprint of each molecule is then composed of a nested Python dictionary specifying the unique identifiers of its constituent fragment entities and the structural links between them to give a hierarchical molecular encoding scheme. Four case studies show the application of our fingerprinting scheme to the subject database. In each case, the clustered molecules display a host of interesting chemical trends. The application that was used to develop and implement this bespoke fingerprinting scheme, referred to as ChemCluster, also exposes a host of other cheminformatics tools pertaining to this database, a selection of which is demonstrated in this work. The enhanced similarity comparisons afforded by our fingerprinting scheme, as well as the large repository of categorized fragments generated during its development, constitute the first step toward using this database in a data-driven materials discovery workflow.
INTRODUCTION
One of most attractive features of organic molecules, particularly in optical applications, is the ease with which their absorption properties can be tuned, e.g., by the introduction of heteroatoms or the addition of functional groups.1 A database consisting of 4500 optically absorbing organic molecule absorption properties was recently generated by Beard et al. using a combination of natural language processing methods and computational simulations.2 Such a repository will contain many examples of clusters of molecules, which show trends in some property space, while differing only by subtle structural or topological differences. If there was a way to classify such clusters, then this database would become very useful for a range of data-science and machine-learning applications that seek to capture chemical information and trends in order to tune these optically absorbing materials for functional applications.
Such clusters may be classified using chemical similarity metrics. Molecular fingerprints are often used to determine the similarity of two molecules in some space by casting each molecule into a unique, computer-readable format that encodes a subset of features, structural or otherwise. Common fingerprinting schemes, such as Extended-Connectivity Fingerprints (ECFPs), are structure-based, but in the domain of optical absorption spectroscopy for organic molecules, structural comparisons alone cannot provide sufficient accuracy to permit inferences in the spectral space.
Furthermore, although existing fingerprinting schemes are reasonably well suited for clustering molecules based on small structural differences, they are often incapable of capturing subtle chemical changes when the accompanying structural changes are significant. For example, the well-known class of organic dyes classified by the D–π–A structural motif consists of an electron donor (D) and acceptor (A) group, connected by a π-conjugated bridge. Typical groups that act as linker units in the π-conjugated bridge are thiophene and phenoxazine, as well as simple methine chains. The typical role of these π-conjugated bridges, in the context of absorption applications, is to mediate charge transfer across the dye; the structure, however, can differ greatly. Similarly, it is common to find a collection of organic dyes consisting of a core group, such as azo or coumarin, with a variety of different chemical substituents whose purpose is to shift absorption properties in some small but deterministic way. A generic fingerprinting scheme has no concept of “hierarchy” and gives equal importance to both the core group and the substituents when encoding molecular features. Therefore, any comparisons of similarity using these fingerprints are likely to underestimate the true similarity of the molecules in such clusters.
This work realizes a hierarchical fingerprint scheme by categorizing fragments according to their chemical “role” in the molecule, that is, whether they are a core group, a substituent, or a bridge segment. Our fingerprinting scheme better caters for the domain-specific chemistry that underpins the true similarity of these optically absorbing molecules. Using this scheme, a number of clustering procedures are developed and applied to the aforementioned database to discover a range of molecular clusters, which show some of their cognate property trends that are inherent in the database. In addition, a large collection of relevant fragments is generated for use in future fragment-based molecule-discovery workflows. This is the first step toward the ultimate objective of using this database in a data-driven materials discovery workflow.
DATABASE OVERVIEW
Before describing any developments, it is instructive to give some background information about the database that is subjected to analysis in this study. It was mentioned in the Introduction that the database contains a mixture of experimental and computational data. The experimental data, which comprise wavelength and optical absorbance information about organic molecules, were obtained by extracting UV/Vis absorption spectral data of molecules from thousands of research papers from 33 journals. It is important to note that no “new” molecules were generated while populating the database, i.e., the database includes only known molecules for which data have been published; thus, they all presumably have known synthetic routes. The molecules collected from the online papers were passed through an opto-electronic screen. This yielded 4591 successful candidates, which were subjected to simplified Tamm–Dancoff-Approach (sTDA) calculations to obtain their excited-state properties. A number of the more promising candidates (∼200) were then subjected to full Time-Dependent Density-Functional-Theory (TD-DFT) analysis. In this work, it is the sTDA calculations that are of interest because of the more heuristic nature of the study: the trends in the data are more important than the exact values themselves; therefore, the fact that more molecules have associated computational data in the case of sTDA makes it a more attractive option.
The sTDA calculations produced a number of property predictions pertaining to the excited state of a molecule, including excitation energies and oscillator strengths, which are of particular interest to this work. For each excitation, the dominant highest-occupied molecular orbital (HOMO) and lowest-unoccupied molecular orbital (LUMO) interactions (e.g., HOMO → LUMO, HOMO-1 → LUMO) are also provided among other properties, such as polarity and charge distribution. In the work of Beard et al., the wavelength of maximum absorption for a molecule, λmax, is defined as being the strongest of the first three excitations, where strength is measured using oscillator strength, f. A “bird’s eye” view of the database, in terms of the two aforementioned key metrics, is shown in Fig. 1, which represents a hexagonal plot of the distribution of (λmax, f) among the optically absorbing molecules in the database. Remarking on this distribution, there are two important observations. First, the majority of molecular calculations afford a peak absorption between 250 and 300 nm, which is just outside the visible region, and hence, one would assume that this is not an ideal result for applications such as absorbing dyes in solar-cells. However, it is important to note that all the calculations are in the gas phase, which results in a systematic lowering of the wavelengths (increase in energies) with respect to the solution phase (where they are most commonly found in application), as is demonstrated in the original paper.2 Second, there are a significant number of molecules that have an oscillator strength greater than 1, which, technically speaking, is unphysical. This is the price that is paid for the reduction in the computational level afforded by sTDA calculations, which may be attributed to the limited basis set. While this plot demonstrates that the exact values of oscillator strength are not reliable, the values are still useful to show general trends in the data and one can assume that a molecule with an oscillator strength greater than 1 is still strongly absorbing at that wavelength.
Hexagonal plot of the (λmax, f) values for each molecule in the database that has realistic sTDA computational data. The λmax value is chosen as the strongest of the first three transitions, where strength is determined using the oscillator strength.
Hexagonal plot of the (λmax, f) values for each molecule in the database that has realistic sTDA computational data. The λmax value is chosen as the strongest of the first three transitions, where strength is determined using the oscillator strength.
In terms of descriptors and cheminformatics, basic theory suggests that conjugation and aromatic rings are present to some degree in nearly all of these absorbing organic molecules. Additionally, many of these molecules contain heteroatoms, such as nitrogen, oxygen, and sulfur in the conjugated cycles. The distributions of each of these chemical metrics are shown in Fig. 2. An analysis of the types of molecules in the database might provide a good indication as to the specific domains of research that are represented in the database, e.g., synthetic dyes vs dyes used in solar-cell research (e.g., D–π–A-type dyes). For this reason, we compiled a list that contains some of the most prevalent chemical groups found in optically absorbing organic dyes across multiple domains; a SMARTS4 (SMILES Arbitrary Target Specification, where SMILES is a Simplified Molecular-Input-Line-Entry System) pattern matching routine was used to test for the frequency of occurrence of each group (Table I).
Distribution plots for four of the most relevant properties pertaining to molecules in the database: (a) the number of conjugated bonds in the longest conjugated region, (b) the number of aromatic rings in the molecule (smallest set of smallest rings), (c) the number of aromatic rings containing at least one heteroatom, and (d) the number of heavy (non-hydrogen) atoms in the largest BRICS3 fragment.
Distribution plots for four of the most relevant properties pertaining to molecules in the database: (a) the number of conjugated bonds in the longest conjugated region, (b) the number of aromatic rings in the molecule (smallest set of smallest rings), (c) the number of aromatic rings containing at least one heteroatom, and (d) the number of heavy (non-hydrogen) atoms in the largest BRICS3 fragment.
A list of groups that are commonly found in optically absorbing organic molecules, alongside the number of molecules in the database containing each group. The presence or absence of a group in a molecule is determined using SMARTS pattern matching.
. | Group . | Occurrence . | . |
---|---|---|---|
Thiophene | 255 | ||
Azo | 263 | ||
Anthraquinone | 58 | ||
Coumarin | 171 | ||
Benzothiazole | 116 | ||
Cyanine | 18 | ||
Fluoresceins | 5 | ||
Triarylmethane | 66 | ||
Squaraine | 5 | ||
Indigoid | 2 | ||
Cyanoacrylate | 110 | ||
Phenoxazine | 7 | ||
Carbazole | 185 | ||
Indole | 104 | ||
Triarylamine | 157 |
. | Group . | Occurrence . | . |
---|---|---|---|
Thiophene | 255 | ||
Azo | 263 | ||
Anthraquinone | 58 | ||
Coumarin | 171 | ||
Benzothiazole | 116 | ||
Cyanine | 18 | ||
Fluoresceins | 5 | ||
Triarylmethane | 66 | ||
Squaraine | 5 | ||
Indigoid | 2 | ||
Cyanoacrylate | 110 | ||
Phenoxazine | 7 | ||
Carbazole | 185 | ||
Indole | 104 | ||
Triarylamine | 157 |
The small number of molecules containing well-known classes of dyes, such as cyanine and indigoid, suggests that the dataset is not representative of all research in the domain of synthetic dyes; the similarly sparse presence of fluoresceins and squaraines suggests that dyes with fluorescence or near-infrared (NIR)-based applications are not in abundance either. In contrast, the large number of molecules with thiophene, triarylamine, and cyanoacrylate suggests that the recent literature on dyes is dominated by molecules that pertain to research in the field of dye-sensitized solar-cells (DSSCs). The carbazole group is also commonly used as a donor group in D–π–A molecules, so its high occurrence is not surprising. Of the six most commonly occurring units, it would be useful to obtain an estimate for the distribution of λmax values within each group. If a group was to have a narrow distribution, it might indicate that either (a) this group is the main chromophoric entity of the molecule or (b) this group was the focus of study in experiments whereby substituents/functional groups are swapped, added, and removed in order to tune properties. The nature of these distributions for each of these dye groupings is represented in Fig. 3.
Violin plots representing λmax distributions for the molecules that contain one of a selection of prominent groups. White dots indicate a median. Boxes show interquartile ranges. Upper and lower whiskers show extremes. Any data beyond whiskers are deemed outliers.
Violin plots representing λmax distributions for the molecules that contain one of a selection of prominent groups. White dots indicate a median. Boxes show interquartile ranges. Upper and lower whiskers show extremes. Any data beyond whiskers are deemed outliers.
It is interesting to note that while both the azo and the thiophene groups are equally prevalent in the database, their distributions differ greatly. The distribution of molecules with azo groups is quite narrow, indicating, as expected, that the azo group is the main absorbing moiety in the molecule. In contrast, the thiophene groups show a broad distribution and this is in keeping with the typical usage of thiophene groups as units in building a π-conjugated bridge rather than being core entities. Similarly, the coumarin group has a narrow distribution, which is in keeping with its being a core constituent, whereas triarylmethane has a broad distribution, presumably because it is typically used as a donor in molecules with a D–π–A motif; as such, it is only one part of the collective molecular framework.
A list of groups that occur most frequently in combination.
No. . | Groups . | Occurrence . |
---|---|---|
1 | (Azo, benzothiazole) | 55 |
2 | (Cyanoacrylate, triarylamine) | 55 |
3 | (Thiophene, cyanoacrylate) | 54 |
4 | (Thiophene, triarylamine) | 45 |
5 | (Thiophene, carbazole) | 32 |
6 | (Thiophene, azo) | 22 |
No. . | Groups . | Occurrence . |
---|---|---|
1 | (Azo, benzothiazole) | 55 |
2 | (Cyanoacrylate, triarylamine) | 55 |
3 | (Thiophene, cyanoacrylate) | 54 |
4 | (Thiophene, triarylamine) | 45 |
5 | (Thiophene, carbazole) | 32 |
6 | (Thiophene, azo) | 22 |
Finally, as a sanity check on the quality of molecules in the database, it is worth calculating which pairs of groups occur in combination most frequently to validate that the resulting patterns agree with our expectations from established domain theory (Table II). It is encouraging to note the presence of entries 2, 3, 4, and 5 as these all correspond to groups in typical D–π–A-type molecules; this supports previous evidence that suggests that the recent literature on dyes is dominated by papers pertaining to solar-cell and similar research.
DEVELOPMENT OF A HIERARCHICAL FINGERPRINTING SCHEME
A chemical fingerprint is a fixed-length vector of bits where the on-bits signify the presence of a certain feature. In this way, chemical information is transformed into a numerical code that is suitable for computational applications.5 The most common class of structure-based fingerprints is the ECFPs, the first implementation of which was the Morgan fingerprint (MF),6 while by far the most common similarity metric is the Tanimoto similarity index,7 although other metrics, such as Dice similarity,8 are not uncommon.9,10 Other common fingerprinting schemes exist, such as path-based fingerprints, which are not unlike circular fingerprints but have distinct differences. Examples of path-based fingerprints are the ones presented by Daylight Chemical Information Systems,11 as well as the Rational Discovery Kit (RDKit) bespoke fingerprint scheme (an extension of the Daylight algorithm).12 It is instructive to also mention structural keys, such as the Molecular ACCess System (MACCS) keys, which, having originally been intended for substructure searching of databases, have in recent years come to be used as a genuine fingerprint alternative.13 Unfortunately, at least in the field of organic spectroscopy, basic chemical intuition would suggest that structural similarity alone cannot always account for the observed spectral similarities between molecules. Maggiora et al. gave an excellent review of this distinction between the chemical, structural, and molecular similarity.9
One of the motivations behind this work is to develop an easy interface with which one can explore a large repository of optically active molecules, thus enabling the automated clustering of similar molecules according to user-defined, tailored queries. In this way, trends between chemically similar molecules are revealed and the database is readily exposed to more advanced data mining. Crucial to achieving this task is the development of a bespoke fingerprinting scheme, which is specific to the chemistry that dominates the molecules in this database.
Step 1: Fragments and groups
The fingerprinting scheme presented in this work aims to break a molecule down into its constituent core groups, substituents, and linker bridges. Furthermore, the scheme encodes the various substituents according to which groups they are attached to and, in the case of more than one instance of the same group, to which instance. This allows for more detailed queries to be specified, should it be deemed necessary by the user. The overarching mechanism for isolating the constituent components of a molecule is shown schematically in Fig. 4. The first step involves decomposing a molecule into its Breaking of Retrosynthetically Interesting Chemical Substructure (BRICS) leaf fragments.3 BRICS was specifically chosen for this first step because it only cleaves bonds with known synthetic routes, so it naturally preserves the majority of core groups as one contiguous, conjugated block. Once the fragments have been generated, the list of previously generated fragments is indexed to see if this fragment already exists in another molecule. If it does not, a new “fragment” object is created. When distinguishing fragments, all details of the fragment, including the position of the link atoms on rings and the category of the link atom, are considered. This is done should the database ever wish to be used for fragment-based molecule discovery: when linking two fragments, the compatibility of their link atoms (as defined in the BRICS paper3) is essential.
Schematic demonstrating the workflow that produces a list of 1027 core groups from the 4591 molecules in the original database. The quantity of entities at each stage of the workflow is shown at the bottom.
Schematic demonstrating the workflow that produces a list of 1027 core groups from the 4591 molecules in the original database. The quantity of entities at each stage of the workflow is shown at the bottom.
Once every molecule has been decomposed, to produce a collection of leaf fragments, the next step is to screen the fragments for potential core groups by applying chemical intuition. This involves checking that a fragment meets both of the following criteria: (a) it contains at least one conjugated carbon atom, where conjugation is defined as having hybridization other than sp3 and (b) it contains at least one atom that is a member of a ring. The latter condition is similar to the Murcko definition of core moieties,14 but the former condition ensures that non-conjugated rings do not classify; this is an important distinction for optically absorbing organic molecules. As an additional step in the screen, each fragment is checked to ensure that it has at least one parent molecule for which there is associated computational data. To see why this filtering process is necessary, one only needs to look at the top eight most common fragments in Fig. 5. Only the arene-based fragments constitute a core group. This screening process generates a list of “clean” fragments, which are advanced to the next stage: the grouping phase. Therein, link atoms are removed and stereochemistry is disregarded. Additionally, direct substituents are removed. The term “direct” here implies single or two atom substituents, such as halogens and alcohols. Other, longer substituents should already have been removed by the BRICS fragmentation algorithm, which, in the majority of cases, cleaves the bond between a core group (a conjugated ring structure) and its substituents, provided that the substituent is longer than a couple of atoms. The reason for not cleaving the bond in the case of small substituents is that the BRICS algorithm prevents the creation of redundant or inconsequential fragments, such as single –OH groups or halogens. Similarly, long alkyl chains are not cleaved because a long alkyl chain is considered to be a trivial fragment by the algorithm. As a result, long alkyl chain substituents need to be shrunk as part of this step. It is worth noting that resonance structures are not differentiated during the generation of core groups.
The eight most common fragments in the database subsequent to applying the BRICS fragmentation algorithm. The occurrence represents how many molecules in the database have that fragment. With the exception of benzene, the other six fragments do not constituent a core group under any definition and must be filtered out. The numbers connected to each fragment represent the BRICS chemical environment of that fragment, as per the definitions in the original BRICS paper.3
The eight most common fragments in the database subsequent to applying the BRICS fragmentation algorithm. The occurrence represents how many molecules in the database have that fragment. With the exception of benzene, the other six fragments do not constituent a core group under any definition and must be filtered out. The numbers connected to each fragment represent the BRICS chemical environment of that fragment, as per the definitions in the original BRICS paper.3
This begs the following question: How is something identified as being a direct substituent, as opposed to being something like the carbonyl oxygen in coumarin dyes, which is inherently part of the core group, and should, therefore, not be removed? A list of common direct substituents was compiled by searching the fragment list for isolated benzene fragments (i.e., fragments with only one ring structure—benzene) and then removing the benzene structure to leave a molecule consisting of only detached substituents. In this way, since isolated benzene is by far the most common fragment, it is expected that this list would be exhaustive. Once the list of substituents had been compiled, each fragment group was searched for the presence of one or more of these substituents using SMARTS. An additional step still needed to be applied to prevent groups such as the carbonyl oxygen described above from being stripped. If the first atom in the substituent is not attached to the ring via a single bond, then it is not removed as it is most likely part of the conjugated framework; otherwise, it is removed. Having applied the above procedures, the fragments were reduced to their “core” forms. Multiple different fragments can reduce to the same core form and, as a point of reference, our process yielded 3379 fragments but only 1027 groups overall.
Step 2: Substituents and bridges
In step 1, it was described how the core groups in each molecule were identified. This still left the task of identifying the substituents and the bridges. Of course, everything that is not a group is either a substituent or a bridge, so the task was to distinguish between the two. The substituents were identified as follows:
For every group in the molecule, the atoms of which that group is composed were determined using the GetSubstructMatches function in RDKit15 to produce a “match” (a tuple of atom indices). If more than one instance of a group occurred, then there were multiple separate matches.
The atom subsets that constitute a match were removed one by one from the molecule with replacement. Every time a match was removed, the resulting “reduced” molecule was composed of multiple non-connected segments, as many as there were branches off of that particular group instance. In SMILES (simplified molecular-input-line-entry system) notation, these segments are all separated by a literal “.”.
Each of these segments was then checked for the presence of any other group match (including other instances of the same group). If the segment contained a different match, then it could not possibly be a terminal segment and it had to be a bridge; if it did not contain a different match, then it was considered to be a substituent.
This process was repeated for every match. It is worth noting that sometimes a group is a subset of a larger group (e.g., benzene and naphthalene); when this was the case, the group match was skipped because it would otherwise remove itself from the larger group as well, causing complications. The substituents were instead determined via the larger group.
The entities that were produced during the processes described in steps 1 and 2, along with the relative size of each entity repository, are shown in Fig. 6.
Schematic showing the entities produced during the hierarchical fingerprinting algorithm that separates each molecule into its constituent core groups, substituents, and bridges. In order to determine the core groups, the molecules first need to be fragmented; in order to determine the substituents and bridges, the core groups need to be known. The relative size of each entity repository is reflected here.
Schematic showing the entities produced during the hierarchical fingerprinting algorithm that separates each molecule into its constituent core groups, substituents, and bridges. In order to determine the core groups, the molecules first need to be fragmented; in order to determine the substituents and bridges, the core groups need to be known. The relative size of each entity repository is reflected here.
Step 3: Fingerprint encoding
Once a repository of clean fragments, groups, substituents, and bridges has been generated, the next step is to assign to each molecule a fingerprint that encodes this information. Then, queries can be performed on the database to extract molecules that match a certain combination of any of the aforementioned components. Traditional schemes use bit-vectors to encode molecule fingerprints. In our case, the fingerprint was encoded as a dictionary with the keys representing the entity type (group, substituent, or bridge) and the values being an array of tuples where each tuple captures (component_id, quantity). For substituents, the instance of the group, to which the substituents are attached, was also stored to allow for more advanced querying. An example of the fingerprint encoding scheme is shown in Fig. 7. The molecule shown in this example has the internal dictionary representation: {“groups”: (11, 1)_(2, 2), “subs”: (1, [(11, 0]), “bridges”: (0, 2)}. This encoding is interpreted as a molecule with one instance of group identifier (ID) 11, two instances of group ID 2, one instance of substituent ID 1 (attached to the zeroth occurrence of group ID 11 in the molecule), and two instances of bridge ID 0. The IDs assigned to each entity are arbitrary but consistent, in that each run of the ChemCluster fingerprinting algorithm will generate the same entities with the same IDs. The ChemCluster application provides convenience functions that display an image of a particular entity by passing in its ID. This method of encoding fingerprints was selected over the typical bit-vector due to the hierarchical nature of our fingerprints: when determining similarity, groups are more important than substituents and bridges, and this information is difficult to capture using the mechanics of bit-vectors. Additionally, the dictionary scheme offers a convenient interface for constructing queries to select a subset of molecules. An example of clustering molecules according to this new scheme is shown in Fig. 8, where molecules with the same group fingerprint (including counts) are grouped together. In order to help clarify its visualization, a minimum of 15 molecules per cluster was required, and the clusters have been split across four plots with six clusters per plot. The distinguishing features of the molecules in a cluster are their substituents and bridges. The clusters are quite compact, particularly in (c) and (d), but there are outliers present in most cases. The clusters are more compact in terms of their wavelengths compared with their oscillator strengths, but it is known that the oscillator strengths are not reliable, so this is not a surprising result.
An example of the group, bridge, and substituent entities generated by applying the fingerprinting algorithm described in this work to a randomly selected molecule from the database. The internal dictionary representation of the fingerprint assigned to this molecule is as follows: {“groups”: (11, 1)_(2, 2), “subs”: (1, [(11, 0]), “bridges”: (0, 2)}, which means one instance of group ID 11, two instances of group ID 2, one instance of substituent ID 1 (attached to the zeroth occurrence of group ID 11 in the molecule), and two instances of bridge ID 0.
An example of the group, bridge, and substituent entities generated by applying the fingerprinting algorithm described in this work to a randomly selected molecule from the database. The internal dictionary representation of the fingerprint assigned to this molecule is as follows: {“groups”: (11, 1)_(2, 2), “subs”: (1, [(11, 0]), “bridges”: (0, 2)}, which means one instance of group ID 11, two instances of group ID 2, one instance of substituent ID 1 (attached to the zeroth occurrence of group ID 11 in the molecule), and two instances of bridge ID 0.
Using the grouping algorithm, molecules are clustered into subsets having the same groups (including counts). The legend in each plot displays the group segment of the fingerprint associated with each cluster. There are 24 clusters having 15 or more molecules. The (λmax, f) distributions of these 24 clusters are shown here spread over four plots for clarity. The oscillator strengths show more variation within a cluster, likely due to the more unreliable nature of these values compared to the wavelengths. The clusters in (a) and (b) are not quite as compact as those in (c) and (d). The outliers in a cluster are of interest to discover the substituent or bridge effect that is causing the large shift.
Using the grouping algorithm, molecules are clustered into subsets having the same groups (including counts). The legend in each plot displays the group segment of the fingerprint associated with each cluster. There are 24 clusters having 15 or more molecules. The (λmax, f) distributions of these 24 clusters are shown here spread over four plots for clarity. The oscillator strengths show more variation within a cluster, likely due to the more unreliable nature of these values compared to the wavelengths. The clusters in (a) and (b) are not quite as compact as those in (c) and (d). The outliers in a cluster are of interest to discover the substituent or bridge effect that is causing the large shift.
It is an interesting exercise to analyze the molecules of a particular cluster that have outliers to try and understand the reason why these outliers occur. For example, looking at the plot in Fig. 8(b), the gray cluster contains molecules with the fingerprint {“groups”: (2, 1)_(391, 1)} (which translates as: one instance of group ID two and one instance of group ID 391), and there are a number of outliers with very small oscillator strengths. A sample of this cluster is shown in Fig. 9. One can chemically intuit that the reason for the two low oscillator strengths (marked in red) is the carbonyl bridge connecting the two main groups. The carbon in this bridge is electron deficient, which decreases the extent of conjugation across the groups and results in a much weaker transition.
A sample of the molecules in the gray cluster in Fig. 8(b). This particular cluster has outliers in terms of oscillator strength, and it is demonstrated here that the electron-deficient carbonyl bridge is the cause.
A sample of the molecules in the gray cluster in Fig. 8(b). This particular cluster has outliers in terms of oscillator strength, and it is demonstrated here that the electron-deficient carbonyl bridge is the cause.
RESULTS
Our hierarchical fingerprinting scheme described above permits the discovery of chemical trends inherent in the database of ∼4500 optically absorbing organic molecules generated by Beard et al. Using our scheme, we were able to cluster the molecules in this database according to a range of interesting chemical similarity queries. The results of four such queries are presented herein. Specifically, we analyze clusters whereby the molecules in each similarity subset (1) share the same core group, (2) are the same up to topological considerations, and (3) are the same up to their π-conjugated bridge segments. As an additional case study, we study the impact of the cyanide group on the optical absorption properties of optically active organic dyes.
Molecules with the same core groups
The first query is rather basic and seeks to cluster molecules based on their group fingerprint, including counts, i.e., classify all molecules that contain not only the same groups but also the same quantity of each group. Substituents and bridges are ignored. Running the query produces a large number of subsets; not all of them are entirely insightful. It would therefore be beneficial to screen for those subsets where there is a large difference between the maximum/minimum wavelength and/or oscillator strength of the molecules in the subset. Where the groups, and the counts of the groups, are identical, it can be supposed that the difference is due to one or a number of substituent and bridge effects. A series of subsets resulting from this query is shown in Figs. 10 and 11.
(a)–(h) A subset of molecules in the database that have the same group fingerprint (with counts). This subset was discovered by applying a filter over all subsets to reveal those displaying a large difference in computational values within the subset, indicating the presence of a dominant substituent or bridge effect. The lack of conjugation in the bridges of molecules (c), (d), (e), and (h) causes a very low oscillator strength relative to other molecules, highlighting a fundamental principle of optical absorption spectroscopy on organic molecules.
(a)–(h) A subset of molecules in the database that have the same group fingerprint (with counts). This subset was discovered by applying a filter over all subsets to reveal those displaying a large difference in computational values within the subset, indicating the presence of a dominant substituent or bridge effect. The lack of conjugation in the bridges of molecules (c), (d), (e), and (h) causes a very low oscillator strength relative to other molecules, highlighting a fundamental principle of optical absorption spectroscopy on organic molecules.
A second example of the subsets arising from clustering by group fingerprint. In this case, two separate clusters are shown [(a) and (b)]. In (a), the only change is the substituent on the leftmost benzene ring. The nitro-group is the outlier in this case, and the almost non-existent oscillator strength can be explained by the strong electron withdrawing character of the nitro-group, which decreases electron density in the conjugated framework. In (b), as the length of the methine bridge increases from left to right, the wavelength and oscillator strength also increase as expected.
A second example of the subsets arising from clustering by group fingerprint. In this case, two separate clusters are shown [(a) and (b)]. In (a), the only change is the substituent on the leftmost benzene ring. The nitro-group is the outlier in this case, and the almost non-existent oscillator strength can be explained by the strong electron withdrawing character of the nitro-group, which decreases electron density in the conjugated framework. In (b), as the length of the methine bridge increases from left to right, the wavelength and oscillator strength also increase as expected.
Analysis of both of these subsets offers a perfect demonstration of one of the most fundamental concepts in optical absorption spectroscopy on organic molecules: that extent of conjugation is one of the key considerations in producing a strongly absorbing dye and for tuning the wavelength of that absorption. The fact that putting a single non-conjugated bond in between two heavily conjugated groups causes such a dramatic shift in the oscillator strength (see molecules with very low f values marked in red, within Fig. 10) is completely in keeping with theory and emphasizes the usefulness of our hierarchical classified interface for the database from an educational perspective. Additionally, the inset in Fig. 11(a) shows the influence that a change in the substituent can have on the f value, with the nitro-group pulling electron density from the framework and decreasing the excitation strength. Meanwhile, Fig. 11(b) displays the effect of progressively increasing the length of a methine chain on the λmax and f values.
Another interesting finding is that which contains the molecules shown in Fig. 12. This demonstrates an important example of the advantage of our tailored fingerprinting scheme for two reasons. First, it highlights the benefits of the BRICS fragmentation as the first step to produce grouped molecules because, as previously mentioned, the BRICS algorithm only cleaves bonds with a known synthetic route; as it happens, this coincides nicely with natural separation points between chromophores within an optically absorbing organic molecule. With this being considered, neither the azo bond itself nor the bond between nitrogen and the phenyl group was cleaved by the BRICS algorithm, thus preserving the core of the azo dye as one group. If another method for automatically deciphering groups had been used, such as Murcko decomposition, the nitrogen double bond would have been deemed to be a link between two groups rather than be classified as a single, contiguous entity. Second, while SMARTS matching is easily capable of extracting all molecules from the database with an azo segment, this is as far as SMARTS matching can go in terms of clustering unless additional information is introduced, in which case the “automated” element of the application and the exploratory advantages that come with it are sacrificed. There are multiple files containing different azo groupings that are presented in the supplementary material. The azo clusters were discovered more easily by applying a SMARTS filter to the query result (this filter is included as an option to the query function for convenience of use). This is explained in more detail in the README of the source repository that is provided as a link to this paper.
A subset of azo clusters demonstrating the importance of using the BRICS algorithm for the initial fragmentation as opposed to other alternatives, such as Murcko decomposition.
A subset of azo clusters demonstrating the importance of using the BRICS algorithm for the initial fragmentation as opposed to other alternatives, such as Murcko decomposition.
Molecules with topological differences
Our hierarchical fingerprinting scheme is well suited for discovering clusters of molecules where one or a combination of groups, substituents, and bridges differs between molecules. Meanwhile, it can also be used to discover subsets of molecules where the molecules differ by their topology only, e.g., ortho-, meta-, and para-substituents or a variation in the connectivity of groups. Querying for topological subsets entails searching for molecules with identical fingerprints as per the encoding scheme described earlier in step 3 of our method development. Since each molecule in the database possesses a unique canonical SMILES string (and is therefore a unique compound), molecules with identical fingerprints must differ by connectivity only.
Running this query over the database results in 199 unique subsets, some more insightful than others. It would be convenient to screen for those subsets where either one or both of the absorption wavelength and the oscillator strength vary significantly between molecules. Applying this filter with Δλmax > 75 nm and Δf > 0.6 reduces the number of subsets to just 18, where Δλmax > 75 nm means that there is a difference of at least 75 nm in the λmax values between any two molecules in the cluster, and similarly for Δf. An example of four such subsets is shown in Fig. 13. The examples in Figs. 13(a), 13(c), and 13(d) can be described as para vs meta, para vs ortho, and para vs meta, respectively. It is known that in conjugated frameworks, when the substituents of an aromatic ring are meta to one another, the even number of bonds between substituents precludes an arene ring from forming a quinoidal resonance structure, thereby forcing the ring to adopt its aromatic form exclusively, which limits the ability of such a molecule to exhibit intramolecular charge transfer.16 This observation explains the trends of higher oscillator strengths for the para-isomers in Figs. 13(a) and 13(d), but does not account for the increased wavelength for the meta-isomer in Fig. 13(d). For additional clarity, it was decided to check if both excitations of these meta-isomers are of the same order. As it turns out, the molecule on the left of Fig. 13(d) has a strong first excitation at 317.9 nm and negligible second and third excitations, while the molecule on the right has a strong third excitation at 294.1 nm and negligible first and second excitations. The reason for both molecules having different excitation orders is likely due to their differing extents of intramolecular charge transfer, causing a preference for different molecular orbital contributions. The situation in Fig. 13(b) is interesting due to the very large difference in absorption wavelengths between the two molecules. As before, the excitation orders are different and the wavelengths of the molecules shown on the left and right in Fig. 13 refer to the third and first excitations, respectively. A suggested explanation for why these molecules favor different excitations is the presence of endo–exo isomerism, although without the molecular orbital plots this is not conclusive.
(a)–(d) Topology case study: The molecules in these four subsets have the same hierarchical fingerprint, i.e., their groups, substituents, and bridges are all identical. The only difference is their topology: the connection points of substituents and bridges to the groups.
(a)–(d) Topology case study: The molecules in these four subsets have the same hierarchical fingerprint, i.e., their groups, substituents, and bridges are all identical. The only difference is their topology: the connection points of substituents and bridges to the groups.
Molecules with different bridge segments: Thiophene units
A common class of organic dyes is one with a donor–π–acceptor (D–π–A) architecture. Such dyes are particularly prevalent in DSSCs. Upon absorption, these dyes undergo intramolecular charge transfer (ICT) from the donor to the acceptor, wherefrom electron density is then transferred into the semiconductor to which it is anchored (typically TiO2). An important feature of most DSSCs is that the absorbing dye is anchored via the acceptor group, meaning that the donor region is far from the semiconductor surface, making electron recombination less probable and increasing efficiency overall. The mechanism for generating the excited state in D–π–A molecules is beyond the scope of this work, but an essential component is the presence of a π-conjugated bridge, which acts to mediate the charge transfer across the molecule. As the name suggests, the π-conjugated bridge is heavily conjugated and an interesting point of study is the effect that both the length of the bridge and the motifs of which it is composed have the excited-state properties; this is typically examined by inspecting the absorption properties of a chromophore, namely, λmax and the extinction coefficient, ϵ (or its computational analog, oscillator strength, f).
Therefore, a useful query on our database is to find those molecules that contain the same acceptor and donor entities but a different number or type of π-linker units in order to study how enhanced conjugation alters the absorption properties of a dye. Using our new fingerprinting scheme, this can easily be achieved by searching the subject database for molecules that have the same group fingerprint (including counts) except for the exact counts of one group. This query can be performed generally, to find all such clusters of molecules in the database, or a specific group can be provided to the query via its group ID. We take the case of a thiophene group, which commonly resides in the π-linkers of a DSSC dye, as we wish to study its overall effect on the π-bridge. The query works by first grouping molecules that have the same general group ID (i.e., ignoring counts), and then within these subsets, each molecule is tested against every other molecule to identify cases where only the quantities of one of the groups are different: the remainder being identical.
The query returned 21 molecule subsets. An analysis of these output files highlights some interesting trends that are worthy of review. A subset of some of the more insightful trends is shown in Figs. 14 and 15.
In keeping with the discussion of D–π–A-type molecules, it is instructive to inspect the files for which one of the constant groups is a triarylamine unit, as this commonly acts as the donor in DSSC dye molecules. One such example is shown in Fig. 14(a) where the cyanoacrylate acceptor, a very common acceptor unit, is also present. The number of thiophene units in the π-bridge increases by one when viewed from left to right in Fig. 14(a) and the corresponding increase in both λmax and f can be observed. Similarly, in Fig. 14(b), the triarylamine donor entity has now been swapped for another common donor group, but the trend is otherwise the same. This study highlights that while the exact values returned from sTDA calculations have to be treated with caution, the trends in the values are very useful for explaining the underlying chemistry. The advantage of organic molecules in terms of their ready “tunability” is also demonstrated by this case study: synthetically, adding extra thiophene units is not challenging, but it affords predictable and systematic shifts in absorption properties.
Thiophene case study No. 1: the only difference between each of the molecules is the number of thiophene units in the π-bridge. The corresponding increase in absorption properties in both (a) and (b) is explained by the increased extents of conjugation that results from the presence of an additional thiophene unit. Note that the depiction of the thiophene bridge in these examples is not in keeping with traditional representation, where the thiophene orientation alternates and the chain is linear. The RDKit 2D representations of molecules for drawing purposes are more concerned with optimization of structure for visual clarity, so the representations here are not necessarily reflective of scientific considerations.
Thiophene case study No. 1: the only difference between each of the molecules is the number of thiophene units in the π-bridge. The corresponding increase in absorption properties in both (a) and (b) is explained by the increased extents of conjugation that results from the presence of an additional thiophene unit. Note that the depiction of the thiophene bridge in these examples is not in keeping with traditional representation, where the thiophene orientation alternates and the chain is linear. The RDKit 2D representations of molecules for drawing purposes are more concerned with optimization of structure for visual clarity, so the representations here are not necessarily reflective of scientific considerations.
(a)–(h) Thiophene case study No. 2: In this sample grouping, the effect of different substituents on both the thiophene units and the benzene rings is evident. Different substituents either increase or decrease electron density in the benzene ring, causing the benzene-containing moiety to tend toward acting as a donor or an acceptor, respectively.
(a)–(h) Thiophene case study No. 2: In this sample grouping, the effect of different substituents on both the thiophene units and the benzene rings is evident. Different substituents either increase or decrease electron density in the benzene ring, causing the benzene-containing moiety to tend toward acting as a donor or an acceptor, respectively.
Another set of interesting trends is shown in Fig. 15. While only one out of the eight molecules here differs in its number of thiophene units, the more interesting point of study that is highlighted by this particular grouping is the effect that different substituents have on both the benzene rings and the thiophene units in the π-conjugated bridge. Depending on the substituent, the benzene-containing moiety acts as either a donor or an acceptor (e.g., amine vs cyano) and electron density is either withdrawn from the π-conjugated bridge (cyano) or not (no substituent).
Effect of cyanide substituent
Possibly the most common example of trends in the database occurs when a subset of molecules differ only by their substituents. In fact, in many cases, it is not unreasonable to assume that the study of these subsets was one of the objectives of this paper from which they were scraped. The application presented here exposes an interface for accessing these subsets for a particular substituent, specified using its SMARTS pattern. Taking the cyanide substituent as an example, there is a query to discover all subsets of molecules where the only difference between molecules is the presence or absence of a cyanide substituent. An example of four such subsets is shown in Fig. 16. The same trend is observed across all samples: the presence or absence of a cyanide group causes an increase in wavelength and absorption strength. Intuitively, one might expect the exact opposite to occur. Since cyanide is electron withdrawing, it should “pull” electron density from the conjugated framework. However, the electron withdrawing effect of cyanide is not the only contribution that is made to the overall molecular framework; the cyanide group also serves to increase the extent of conjugation. From the examples presented in Fig. 16, it would appear that the increased extent of conjugation owing to the presence of the cyanide ion is the more dominant contribution.
Case study: substituent effects. In each of the four examples of molecular pairs shown here (a)–(d), the only difference between the two molecules is the presence or absence of a cyanide group. Across all four examples, the same trend is observed: the presence of a cyanide group increases the wavelength of the absorption and its strength.
Case study: substituent effects. In each of the four examples of molecular pairs shown here (a)–(d), the only difference between the two molecules is the presence or absence of a cyanide group. Across all four examples, the same trend is observed: the presence of a cyanide group increases the wavelength of the absorption and its strength.
It would be possible to develop this case study further and quantitatively express, on average, how much the addition of a cyanide substituent shifts a computational value, such as λmax, relative to another (or no) substituent. However, the scope of this present work is limited to demonstrating the capability of the application to discover a wide variety of subsets and, as such, a visual demonstration of the results suffices. Furthermore, it was mentioned previously how the sTDA computational values, particularly the oscillator strengths, are only reliable up to trends. Thus, one must proceed with caution if trying to compute a quantitative value to reflect these heuristically presented trends.
CONCLUDING REMARKS
In this work, a database of 4591 optically absorbing organic molecules with associated computationally generated absorption properties has been exposed to a variety of clustering procedures through the development of a bespoke hierarchical fingerprinting scheme, ChemCluster. This is a scheme that uses a combination of molecular fragmentation and chemical intuition to decompose molecules into their constituent group, substituent, and bridge components. This is the first attempt to apply the tools of cheminformatics and extract chemical insights from this database, the largest open-source molecular repository of its kind.2 Through the use of the clustering algorithms developed in this work, a range of molecular subsets has been generated that highlights chemical and structural trends, which are inherent in the data. These trends provide insights into the theory that underpins much of optical absorption spectroscopy on organic molecules. Future work will seek to apply the techniques presented here to larger, more general, molecular datasets to build computational models that are central for applications such as data-driven materials discovery.
SUPPLEMENTARY MATERIAL
See the supplementary material for azo_clusters-1.png (Fig. S1), azo_clusters-2.png (Fig. S2), and azo_clusters-3.png (Fig. S3). The molecules in each of these files all share the same azo-containing core fragment, and the only difference between each molecule within a particular file is the number and/or type of the substituents. The λmax and oscillator strength of the dominant excitation as per the sTDA calculations are shown below each molecule.
ACKNOWLEDGMENTS
J.M.C. acknowledges the BASF/Royal Academy of Engineering Research Chair in Data-Driven Molecular Engineering of Functional Materials. J.M.C. also acknowledges the Science and Technology Facilities Council (STFC) via the ISIS Neutron and Muon Source who partly support this Research Chair.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
All source code for the bespoke fingerprinting scheme and its clustering algorithms is available at https://github.com/PjFlan/chemcluster. This code was also used to generate all plots and images in this paper.