Preprint/Chemical Graph Theory

PLOS Topic Pages
PLOS Computational Biology • PLOS Genetics • PLOS ONE

This is a PLOS Topic Page draft

Public peer review comments are posted here
All content on this page is being developed under a CC BY 4.0 license


Authors
About the Authors 

Mehmet Aziz Yirik
AFFILIATION: Institute for Analytical Chemistry, Friedrich Schiller University Jena , Lessingstraße 8, 07743, Jena, Germany
https://orcid.org/0000-0001-7520-7215

Kumsal Ecem Colpan
AFFILIATION: Institute for Plant Genetics, Heinrich Heine University Düsseldorf , Universitätsstraße 1, 40225, Düsseldorf, Germany
https://orcid.org/0000-0002-8689-218X

Saskia Schmidt
AFFILIATION: Institute for Analytical Chemistry, Friedrich Schiller University Jena , Lessingstraße 8, 07743, Jena, Germany
https://orcid.org/0000-0002-4802-228X

Maria Sorokina
AFFILIATION: Institute for Analytical Chemistry, Friedrich Schiller University Jena , Lessingstraße 8, 07743, Jena, Germany
https://orcid.org/0000-0001-9359-7149

Christoph Steinbeck
AFFILIATION: Institute for Analytical Chemistry, Friedrich Schiller University Jena , Lessingstraße 8, 07743, Jena, Germany
https://orcid.org/0000-0001-6966-0814


Abstract

edit

The chemical graph theory is a subfield of mathematical chemistry which applies classic graph theory to chemical entities and phenomena. In chemical graph theory, molecular structures are represented as chemical graphs. In such chemical graph, nodes and edges represent atoms and bonds. Chemical graphs are main data structures to represent chemical structures in cheminformatics. Computable properties of graphs lay the foundation for (quantitative) structure activity and structure property predictions - a core discipline of cheminformatics. These graphs can then be reduced to graph-theoretical descriptors or indices, which reflect the physical properties of molecules. One of the most famous examples of a graph-based molecular descriptor is the Wiener index, which corresponds to the sum of the lengths of all shortest paths in a molecule and correlates with its boiling points. Besides chemical indices, applications of graph theory in chemistry include many other topics such as isomer enumeration, molecular substructures searching in chemical databasesmolecular structure generation.

Graph Theory Background

edit
 
Graph representation of dopamine molecule. (A) Molecular structure of dopamine. (B) Graph representation of the molecule.

In discrete mathematics, a graph is the illustration of the relationship between a set of objects represented by vertices. The relationship between two objects is shown by an edge. Graphs are studied by graph theory and the term was first used by Slyvester for the definition of chemical structures. In chemistry, molecules are commonly represented as graphs, where vertices and edges respectively represent atoms and bonds. Vertex degrees and edge multiplicities correspond to atom valences and bond multiplicities. The distance between two vertices is the number of edges in the shortest path. In a graph, a path means a walk in which a vertex is only visited once. In a connected graph, between every vertex pair, there must be at least 1 path. Mostly, molecular graphs are connected graphs. In a graph, vertices are adjacent if they are joined by an edge. Graph adjacency is mostly represented by adjacency matrices. In 1874, Sylvester represented an organic molecule connectivity with an adjacency matrix. It is a square matrix whose dimension is   for a molecule with n atoms.


 


By providing the bond orders between atoms, the adjacency matrix can be turned to a connectivity matrix. Molecular connectivity is also represented by connectivity matrices. Different graphs might be topologically equal. In classification of graphs, their equivalence classes comprise isomorphic graphs. Two graphs are called isomorphic if there is an edge-preserving bijection mapping a pair of adjacent vertices to a pair of adjacent vertices in the other graph. Isomers of a molecule are the set of isomorphic molecular graphs. Graph isomorphism and connectivity are key criteria in chemical graph theory, especially in chemical graph generation. The term, subgraph, is equivalent to molecular substructures in chemical graph theory. In the field, the key substructures are cycles. From graph theory, the cyclomatic number means the number of cycles in a graph. The formulation of the cyclomatic number is:


 


In this equation,   represents the number of edges and   represents the number of vertices in a graph. In addition, several mathematicians developed different formulations to determine the number of cycles in both organic and inorganic substances. One of the extended versions of the cyclomatic number formula was developed by the mathematician Oliver Lodge.[1] He used the valence concept to obtain closed chemical formulas for different organic and inorganic compounds, and he invented the formulation below:


 


With this formulation, where   represent the atom valences, he calculated the number of cycles in different covalent molecules.

History

edit

The relationship between graph theory and chemistry has a long way from past to present. Different studies related to both disciplines have built very strong interactions in between and formed the scientific area known as chemical graph theory [2]. The first mentions of chemical graphs were in the late eighteenth century, where the perspective of chemistry was also affected by the ideas of Isaac Newton [3]. Although the studies about the interactions between the atoms gained speed during that century, the chemical bonds were not identified. Thus, the first usage of chemical graphs was for representing the hypothetical forces between the molecules and atoms [4].

In 1805 John Dalton had built the first atomic model by representing atom types with specific circles, which could show only the chemical positions and number of atoms in a molecule [5]. However, August Kékule showed both physical positions and orientations of atoms in a molecule. In his “Tetrahedral Carbon Atom” model (Figure 2), he classified several organic molecules and visualized the bond orders between atoms, including the benzene ring [6].

 
Tetrahedral carbon atom model.

This work led the three-dimensional thinking in chemical modeling to start and the tetrahedral carbon atom model affected not only organic but also inorganic chemical compounds’ structure-modeling [7]. Following Kékule’s work, Alfred Werner developed the new scientific field, coordination chemistry by inventing the idea that the atoms have specific natural properties related to their location in a molecule [8]. In addition, complex compounds were represented with octahedral models.

In 1861, Alexander Butlerov introduced the term “molecular structure” meaning that every chemical substance should have a fixed structure in molecular bases. This term explained several chemical properties of the substances[9]. To develop this concept, illustrations, analysis and formulations of different chemical compounds were proposed in years by scientists such as Johann Wolfgang Döbereiner, Alexander William Williamson and Archibald Scott Coupe [10]. As another representation model, the line representation of bonds between atom pairs was first used in William Higgins' chemical structure models [11]. However, these lines were representing only the interatomic forces, not the specific bonds, therefore Couper's work has been counted as the first graphical edge representation of a chemical bond [12]. The molecular formula of acetic acid was defined and depicted as a chemical structure with straight lines between the atoms in a molecule to represent a chemical bond (Figure 3).

 
Acetic acid chemical and structural representation.

The next breakthrough idea leading to the field is the one of fixed valence bonds of different atom types, and it was first published in the book of Edward Frankland: “Lecture Notes for Chemical Students” in 1866. In this book, several atom types were introduced, such as hydrogen, zinc, boron and carbon with their respective valence bonds [13]. Following these studies, several other chemistry-based graph-theoretical analyses were made. Arthur Cayley and James Joseph Sylvester constructed chemical graphs with respect to the structural formulations of chemical substances [14] [15]. Cayley developed the tree representation of alkanes, kenograms, for structural isomer enumeration of alkanes (Figure 4). In addition, Sylvester labeled vertices with different letters to instruct chemical graphs with a variety of properties.

 
Alkane kenogram (2,3,3-trimethylpentane).

With all these fundamental theories being formulated, the need for pure mathematical analysis and applications in chemistry became explicit, in particular, for the mathematical analysis and discovery of unknown molecules. The evolution of chemical graph theory accelerated especially during the 20th century with the discovery and the synthesis of new molecules [16]. For the structure elucidation of unknown molecules, one of the possible preliminary steps used to be the identification of all isomers for a given or predicted molecular formula. Isomer enumeration became popular in the 1930s for the determination of the possible number of chemical structures for molecules. Besides an apparent simplicity of the task that is the isomer enumeration, the generation of all possible isomers became another key step towards structure elucidation. When formulated as a chemical graph theory problem, the structure generation of all possible molecular structures from a given molecular formula is a combinatorial generation of all possible graphs for given node degrees, i.e. the atom valences. In addition to molecular formulas as input, the inclusion of the substructure information during the generation process shrinks the search space of the chemical graph generation. In chemical graph theory, the earliest generator, CONGEN, came from a Stanford team in the 60s. Following that,many other structure generators have been developed since then.

On the other hand, chemical graph theory is important for the metabolomics which is the study of the set of metabolites present within a given sample. The analyses of metabolomes are important for the discovery of new metabolites, the elucidation of disease processes in living organisms, and many other applications. Several methods for metabolome analyses are currently available, in particular mass spectrometry (MS) and nuclear magnetic resonance (NMR). The exact mass of a molecule almost always corresponds to one molecular formula. During MS analyses, in vitro fragmentation (i.e. breaking down the molecule in substructures), allows the obtention of a series of exact masses that correspond to a given set of substructures. Many CASE suites require such spectral information as input. The information provided in the spectral data, e.g. the present molecular substructures, helps the elucidation of new structures through the assembly of substructures in a single structure in a similar way one assembles pieces of a puzzle. In terms of the chemical graph theory, such information builds the substructures of the searched chemical structures. Therefore, the efficiency of the metabolomics data has a major impact on the accuracy of CASE. The key applications of chemical graph theory for CASE are: isomer enumeration, chemical graph generation, molecular fragmentation and molecular descriptors.

Applications

edit

Isomer Enumeration

edit
 
Three isomers of dichlorobenzene. (A) Molecular structure of 1,2-Dichlorobenzene. (B) Molecular structure of 1,3-Dichlorobenzene. (C) Molecular structure of 1,4-Dichlorobenzene.

The first application of graph-theoretical techniques for solving a chemical problem was in the field of isomer enumeration. In 1811, Louis Joseph Gay-Lussac started a discussion of compounds with the same atom and bond sets but different arrangements. However, the definition of isomerism for chemical compounds was defined firstly as "possessing the same chemical constitution and molecular weight but differing properties" by Jöns Jakob Berzelius, in 1830 [17]. In other words, isomers consist of the same atom and bond sets with different arrangements. In 1862, Butlerov obtained different isomers of methane molecules substituted by chloride molecules [18], which followed by a redefinition of the term “structural isomerism”, which was then explained as having the same chemical formula but different connections between the atoms of substances [19]. Isomers are classified into two groups: constitutional and steric. Constitutional isomers (Figure 5) provide only atom neighbourhood information. However, the configurational information, such as bond angle, bond length, and distance between disconnected atoms, are considered in stereoisomerism (Figure 6).

 
Stereoisomers of C2H2Br2.

Pasteur was the first scientist to introduce the term stereoisomerism, consisting of the same number and type of chemical bonds but in different orientations [20][21]. Chemical graphs are suitable for enumeration of structural isomeric compounds because of their ability to show only the topological positions of the atoms in a molecule. However, they are not convenient for stereoisomeric substances, because of their inability to give information about physical positions of atoms in molecules [22].

The first contribution to the isomer enumeration problem came from a mathematician, James Joseph Slyvester. In his study, the enumeration of rooted trees was discussed. In chemistry, rooted trees correspond to alkyl radicals with i - non-hydrogen atoms (Figure 7).

 
Alkyl radicals.

This enumeration method relied on the polynomial representation of molecular connectivity. In other words, the coefficients of the polynomials represented the number of bonds between atoms. In 1874, Arthur Cayley extended the problem to the enumeration of unrooted trees. Alkanes are chemical examples of these unrooted trees. Cayley’s contributions increased the interest in the problem. Following this study, Hugo Schiff, Hermann, Ferdinand Tiemann and many others worked on the enumeration of alkanes. However, including Cayley, none of them found the general formula for alkane isomer enumeration. In 1937, the general formula for isomer enumeration was found by George Polya [23]. Polya’s enumeration theory was a general counting method with implementation into a variety of fields. The method basically relied on symmetry operations. From mathematics, symmetry groups and their actions were used for the polynomial representations. For the calculation of polynomial coefficients, cycle indices of each symmetry were calculated. The calculation of the cycle indices was based on conjugacy classes of the acting symmetry group [24]. The construction of Polya’s polynomial was step by step described in the mathematical chemistry and cheminformatics book [25]. Another good example came from Pevac [26]. In that study, Polya’s enumeration method was implemented for enumeration of boat and chair cyclohexane isomers. For the molecule, first, its symmetry group was calculated. Then polynomial representation was constructed based on cycle indices. In the literature, isomer enumeration studies were mostly for special compound classes such as alkanes, aromatic hydrocarbons and polycyclic aromatic compounds [27][28][29].

In CASE, the chemical space size of virtual compound libraries are detected by enumeration methods, however, the construction of the isomer sets are performed by the molecular structure generators.

Chemical Graph Generation

edit

Molecular structure generation is a branch of graph generation problem. The earliest molecular structures were modified versions of graph generators. Structure generators generate computer representations of chemical structures adhering to certain boundary conditions. These generators are mostly BFS or BFS based combinatorial algorithms requiring these basic inputs: molecular formula and substructures. To elucidate the structure of an unknown molecule, all combinatorially possible molecular extensions should be taken into account. These algorithms extend intermediate structures in a recursive manner until the molecular saturation; however, the extension of intermediate structures causes a combinatorial explosion. Thus, many structure generators have been designed in line with mathematical theorems. Group theory, especially permutation group theory, has been applied to accelerate the calculation of bond extension. Compared to bond by bond extension, group theoretical methods complete the extension process in one go. CONGEN was the earliest structure generator, a part of the first CASE system, DENDRAL. [30] CONGEN, first, built a tree based structure. Each node of the tree represented a substructure of the unknown molecule. These substructures were extended based on group theoretical lemmas. Besides group theory, many other mathematical theorems have been applied in the field. MASS, a tool for mathematical synthesis and analysis of molecular structures was a matrix based structure generator. This method was considered as an adjacency matrix generation algorithm.[31] In the literature, structure generators are classified into two groups: structure assembly and structure reduction methods.

Assembly methods start the generation with a set of atoms from a molecular formula. Atoms are combinatorially assembled until the saturation. The earliest assembly method was ASSEMBLE from Munk and Shelley.[32] This generator was a part of CASE system, called very trivially, "CASE".[33] ASSEMBLE was not able to deal with substructural overlaps. Contrary to ASSEMBLE, GENOA was a constructive substructure search-based assembly method; which well dealt with substructural overlaps.[34] In the 80s, a set of CASE papers, CHEMICS, was contributed to the field by Japanese scientists.[35] The vector representation of components and their usage in the generation process were the novelties of the study. All component sets were ranked from primary to tertiary used in the extension process.

In assembly methods, molecular extensions usually end up with a combinatorial explosion. To cope with this problem, orderly structure generators have been developed. MOLGEN is a well known orderly structure generator.[36] As descendants of DENDRAL and MASS methods, MOLGEN also generates structures first as connectivity matrices with respect to chemical constraints; then stores in an output file. In the matrix generation, rows (or columns) are built in descending order. MOLGEN is an efficient but commercial structure generator. Besides MOLGEN, another commercial structure generator is from one of the MASS developers, Michael Elyashberg, ACD Labs. The structure generator is a part of the commercial CASE system, StrucEluc.[37]

Unlike assembly methods, reduction methods construct a hypergraph with all possible bonds among atoms. First, the existence of substructures are checked in the hypergraph, then the irrelevant bonds are removed. If a substructure is not in the hypergraph anymore, it is removed from chemical constraints. In some assembly methods, substructural overlaps were not taken into account; however, all structure reduction methods dealt well with structural overlaps due to the hypergraph structure. COCOA was the earliest reduction method from Munk and Shelley[38]; later integrated into the CASE system, called SESAMI.[39] In COCOA algorithm, substructures were represented as atom centered fragments, the list of atoms’ first neighbours. As the improved version of COCOA, HOUDINI was also released.[40] Although structural overlaps were taken into account, the massive size of hypergraphs was the main disadvantage of reduction approaches. Bohanec combined two methods to overcome the disadvantages of both method types; assembly and reduction. The algorithm, GEN[41], avoided irrelevant bonds and assembled atoms based on substructural information.

In the field, MOLGEN was the fastest and an efficient structure generator for decades. As an alternative to MOLGEN, MAYGEN [42] was developed. It is an open-source chemical graph generator and approximately 3 times slower than MOLGEN. Following MAYGEN, the same team developed another open-source generator, SURGE [43] which is the fastest chemical graph generator in the field.

Molecular Fragmentation

edit
 
Four fragments of ZINC42650.

In metabolomics, one major challenge is the identification of unknown molecules. For this purpose, Liquid chromatography (LC), mass spectrometry(MS), tandem mass spectrometry(MS/MS) and Nuclear Magnetic Resonance are commonly used techniques.[44]

Matching the experimentally obtained spectra against spectral libraries is a basic approach for such identification, also called dereplication. However, searching in the reference libraries also has limitations, such as the library sizes and reliability of the spectral data [45]. Thus, successful structure identification through dereplication needs a massive increase in the number of accurate spectral libraries [46], however, the searched molecular formula can also be a new compound. Therefore, in most of the cases, the search in compound libraries does not yield any results. Besides these libraries, search in comprehensive molecular structure databases has also been used in structure identification. To overcome the limitations of spectral libraries and molecular databases, retrieval of structural information from molecular fragments has been the widely-used method in the field. To achieve this, a variety of in silico fragmentation methods have been developed in the last 20 years, in particular rule-based methods and combinatorial fragmenters. The rule-based methods rely on self-determined rules, developed by experimental fragmentation patterns which provide particular structural properties [47]. These methods grant consistent, fast and accurate results. The DENDRAL project recorded the first usage of fragmentation mass spectra and fragmentation rules. Nowadays, the state-of-the-art rule-based fragmenters are Mass Frontier, ACD/MS and MOLGEN/MS. The main disadvantage of rule-based methods is the need for expert-curated rules [48]. Therefore, the usage of the combinatorial fragmentation approaches has been relatively increased [49]. Different from the rule-based methods, combinatorial algorithms generate fragments in an expert-free way, based on the cleavage of the chemical bonds in a given molecule. The cleaving bonds are first scored, then the bond disconnection is performed based on these penalty scores. Many combinatorial algorithms, such as FiD (Fragment iDentificator) and MetFrag, vary based on their scoring functions [50][51]. FiD is a software to identify the structure of the resulting ions of the MS/MS spectrometry application to organic compounds with low molecular weight [52]. It is advantageous for the analysis of compounds, for which the mechanisms of fragmentation are not perfectly known. In FiD, there are two different methods for fragmentation such as the single-step model and the multi-step model. The single-step model does not consider the intermediate fragments and with the multi-step model, the analysis for complex fragmentation pathways is possible. Another combinatorial fragmenter, MAGMA, arranges substructures of the input compound into their best form, explaining the fragmentation model of MSn spectra tree [53]. In the different levels of MSn data, there is a variety of fragment peaks and the algorithm constructs the hierarchical tree representation of these substructures in the MSn data. In chemical graph theory, understanding the generation of subgraphs of chemical structures is as crucial as the generation of molecular structures from a set of bonds and atoms. The coherent fragmentation accelerates the identification of the unknown molecule.

Molecular Descriptors

edit

Molecular descriptors are defined by Todeschini and Consonni as:


“The molecular descriptor is the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment.”[54][55]


Molecular descriptors are, in a broad sense, a way to describe and quantify a chemical structure with mathematical and cheminformatics tools. It needs to be clear that there is no molecular descriptor that fits all applications and that the same molecule can be meaningfully analyzed and described with different descriptors depending on the question to be answered and aims to be reached. Various types of molecular descriptors exist, and many involve chemical graph theory in their definition [56]. Among those, there are chemical indices, topological indices, autocorrelation descriptors, geometrical descriptors and some of the molecular fingerprints. Most of them reveal themselves useful for CASE: to measure the topology and geometry between the query and target molecules, quickly identify identical features between a big number of chemical graphs or enable fast filtering of chemical libraries based on required molecular features.

Topological indices
edit

Topological indices are two-dimensional molecular descriptors based on the topology of the molecular structure when represented as a graph. The molecular graph is the first topological index, as it is the 2D graph representation of a molecule. Therefore, the graph G=(V,E) represents the molecular structure, the set of vertices V represents the set of atoms where each vertex is an atom and the set of edges E represents the bonds between the atoms. The molecular graph is an undirected weighted sparse multigraph. The usage of a graph to depict a molecular structure allows applying well-known graph theory algorithms to it, in order to extract meaningful topological information. These molecular graphs are commonly represented as adjacency matrices or as collections of adjacency lists, where two atoms are adjacent if there is a chemical bond connecting them.

However, adjacency matrices are only one of the possible matrices that can be calculated from a molecular graph, and that can contain different information. These matrices, also called graph-theoretical matrices can be molecular descriptors by themselves or starting points for other molecular descriptors. There are three main types of graph-theoretical matrices: vertex matrices, edge matrices and incidence matrices. Vertex matrices are square matrices where rows and columns refer to graph vertices that are the atoms of the molecule, and each element of the matrix contains information related to the pair of atoms. Edge matrices are also square matrices where rows and columns refer to graph edges that are the chemical bonds of the molecule, and each element of the matrix contains information related to the pair of bonds. Incidence matrices are generally not square and contain information about different types of objects in their rows and columns, such as atoms versus edges, or even bigger molecular elements, such as cycles, molecular fragments or paths.

The adjacency matrix, which is a vertex matrix, allows calculating the Lovasz–Pelikan index [57], which is the largest eigenvalue of the matrix. The other most-used vertex matrix is the topological distance matrix, where each entry of the matrix represents the number of edges in the shortest path between the vertices i and j in the molecular graph. The most known and used topological index in chemistry, based on the topological distance matrix, is the Wiener index. This index is defined by summing the length of all shortest paths between non-hydrogen atom pairs. For a molecule, the half-sum of its distance matrix returns its Wiener index. Many other vertex, edge and incidence matrices-based molecular descriptors have been described across the years by numerous researchers and have then been compiled by Todeschini and Consonni [58].

Using a matrix representation of molecular graphs allows applying a multitude of matrix operators and manipulators enabling calculation of numerous sets of molecular descriptors that highlight diverse information about the molecules.

Chemical Indices
edit

The first usage of chemical indices came from the studies of Calingaert and Hladky [59]. In the study, the proportion of molecular volume and the number of carbons in a hydrocarbon was described as a chemical index for the properties of hydrocarbons. The similar index was used also in the study of Kopp [60], summing the different atom types to describe the volume and densities of molecules. Besides the usage of atom numbers as indices, Wiener introduced one of the earliest graph invariants for the correlation between molecular properties and structural features. In his study, the structural index was used for the determination of paraffin’ boiling point [61]. In 1971, Hosoya introduced a new index: the Hosoya index, which is the number of edge matchings in a graph and formulated as:


 


In this formula, P(G,i) is the number of matchings of i-mutually non-adjacent edges in the graph; in other words, i-covering of chemical graphs. The index has been used in a variety of applications such as the modelling of physicochemical properties of hydrocarbon atoms [62]. The usage of the index was described in Hosoya’s review [63]. In 1947, bonds were differentiated by valences of their end vertices with the study of Hartmann [64]. This study was extended by Randic’s topological index in 1975 [65]. It was also called connectivity index since the formula was based on bond weights. In the formula, given below, d(i) and d(j) represent atom valences for vertices i and j. Bond weight is calculated with the formula [d(i).d(j)]-½ . Thus, the Randic index is calculated by summing all bond weights :


 


This index is not a reliable descriptor for the characterization of molecules since non-isomorphic structures might have the same Randic index [66]. Later, Balaban contributed to the field by slightly modifying the Randic Index [67]. Different than the Randic index, the average of the bond weight sum was taken in the Balaban index. The index formula is :


 


The average value is calculated by multiplying the Randic index by [M/(m+1)]. M is the number of edges and m is the number of cycles in the graph.

In the literature, there are more than 100 different chemical indices, often described as topological indices, which they are generally similar to.

Other theoretical molecular descriptors
edit

Geometrical descriptors are based on three-dimensional molecular structures, where the position of the atoms in the 3D space is known and the connections between them are defined. Molecular 3D structures can be experimentally elucidated from crystallographic or NMR data or computed using molecular optimization algorithms. Geometrical descriptors have a higher information content than those based on 2D structures, such as topological descriptors and chemical indices, but have to be treated with the awareness that their values heavily depend on the molecule conformation, and can vary depending on the latter. Two of the most known classes of three-dimensional descriptors are the WHIM (Weighted Holistic Invariant Molecular) descriptors [68] and the GETAWAY (Geometry, Topology, and AtomWeights Assembly) descriptors [69].

Molecular fingerprints are structural keys that are a well-defined bit list of molecular features present or absent in a molecule. These features can be 2D and/or 3D structural properties and have been very useful for molecular searches, in particular the similarity search, due to their computational effectiveness. They are particularly suitable for Tanimoto and Tversky similarity indices. The molecular fingerprints are determined by fragmenting the molecule in all possible substructures following a set of rules. MACCS and PubChem fingerprints are the most widely used nowadays. The MACCS fingerprint has been developed by the MDL Information Systems (now BIOVIA) and the public version contains 166 pre-computed molecular features that might be present or absent in a molecule. The PubChem fingerprint has been developed to enable efficient and fast similarity search in the PubChem database and is composed of 881 pre-computed structural features. An alternative to the structural keys is hashed fingerprints. This type of fingerprint does not require a pre-computed set of molecular features, but rather breaks the molecular structure on a set of all possible substructures following a set of pre-defined rules. The path-centred Morgan fingerprints and the atom-centred circular fingerprints are the most-used hashed fingerprints, and they are generally used to find structural patterns and substructural similarities in molecules.

Conclusion

edit

Chemical graph theory is a branch of mathematics combining graph theory and chemistry. The field was ensued due to the necessity of mathematical modelling. Molecular structures are represented as graphs and their mathematical analysis is performed based on the graph theorems. One of the key implementations of chemical graphs is the structure elucidation. For the identification of molecular structures, chemists need to build the structures with respect to its spectral data. The methods for the elucidation process are part of chemical graph theory. The key chemical graph theory methods were presented for CASE. Starting from the spectral data of an unknown molecular structure, first the atom, bond and subgraph information are retrieved, then the enumeration as well as the generation of its isomers can be performed. There are further analyses to determine the best candidate structures, such as molecular descriptor based comparison. Although there are recently many effective contributions in the field, there is always the necessity of further enhancements. One of the modern directions of science is artificial intelligence, and the chemical graph theorems will have more impact on the improvements of AI methods such as machine learning, and deep neural networks.

Toolkits

edit

All cheminformatics toolkits can also be considered as chemical graph theory software. Besides cheminformatics toolkits, the general graph theory libraries are also listed. These graph libraries are used for different purposes, such as subgraph search and graph isomorphism. The list of notable software is given below:

Name Home Page
Accord SDK http://accelrys.com/products/datasheets/accord-software-development-kit.pdf
ACD/ StructEluc https://www.acdlabs.com/products/com_iden/elucidation/struc_eluc/
ADMET Predictor http://www.simulations-plus.com
ASSEMBLE http://www.upstream.ch/main.html?src=%2Findex.html
CACTVS http://www.xemistry.com/academic
CDD Vault https://www.collaborativedrug.com/cdd-vault
CDK https://cdk.github.io
ChemDoodlEAPI https://www.collaborativedrug.com/cdd-vault
chemf https://github.com/stefan-hoeck/chemf
ChemmineR http://manuals.bioinformatics.ucr.edu/home/chemminer
COCON http://cocon.nmr.de
Cytoscape https://cytoscape.org
DENDRAL http://www.softwarepreservation.org/projects/AI/DENDRAL/DENDRAL-CONGEN_GENOA.zip/view
Enalos KNIME nodes http://tech.knime.org/community/enalos-nodes
Enalos+ KNIME nodes http://enalosplus.novamechanics.com/
frowns http://frowns.sourceforge.net/
Hellium https://web.archive.org/web/20140407082845/http://www.moldb.net/helium.html
iGgraph https://igraph.org/
Indigo http://lifescience.opensource.epam.com/indigo
JGraphT https://jgrapht.org/
LSD http://eos.univ-reims.fr/LSD/index_ENG.html
Marvin, JChem http://www.chemaxon.com
MAYGEN https://maygenerator.github.io
MedChem Designer http://www.simulations-plus.com
MedChem Studio http://www.simulations-plus.com
Molecular Operating Environment (MOE) https://web.archive.org/web/20160909172415/http://www.chemcomp.com/MOE-Cheminformatics_and_QSAR.html
MolecularGraph.jl https://github.com/mojaie/MolecularGraph.jl
MolEngine https://www.scilligence.com/web/scilligence-regmol/
MOLGEN http://www.molgen.de/
MolSig http://molsig.sourceforge.net
NAUTY http://pallini.di.uniroma1.it
OEChem http://eyesopen.com/
OMG https://sourceforge.net/p/openmg/code/ci/master/tree/src/org/omg/
OpenBabel http://openbabel.org/
OUCH http://www.pharmash.com/posts/2010-08-02-ouch.html
PerlMol https://web.archive.org/web/20120315121757/http://www.perlmol.org/
PMG https://sourceforge.net/projects/pmgcoordination/
Rcpi https://bioconductor.org/packages/Rcpi
RDKit http://www.rdkit.org/
SENECA https://github.com/steinbeck/seneca
SMOG http://ccl.net/cca/software/MS-DOS/SMOG/
SMSD http://www.ebi.ac.uk/thornton-srv/software/SMSD/
surge https://structuregenerator.github.io

References

edit
  1. Cayley P. LVII. On the mathematical theory of isomers. Lond Edinb Dublin Philos Mag J Sci. 1874;47: 444–447.
  2. Bonchev D. Chemical graph theory: introduction and fundamentals. CRC Press; 1991.
  3. Bonchev D. Chemical graph theory: introduction and fundamentals. CRC Press; 1991.
  4. Bonchev D. Chemical graph theory: introduction and fundamentals. CRC Press; 1991.
  5. Cardwell DSL, Dalton J, others. John Dalton & the progress of science. 1968.
  6. Hein GE. Kekule and the architecture of molecules. Adv Chem Kekule Centen. 1966;61: 1–12.
  7. Bonchev D. Chemical graph theory: introduction and fundamentals. CRC Press; 1991.
  8. Werner A. Chem. 3 (1893) 267.(b) A. Werner. Justus Liebigs Ann Chem. 1912;386.
  9. Butlerov AM. Einiges über die chemische Struktur der K [Page no. 1] per. Z Chem Pharm. 1861;4: 549–60.
  10. Alexanderson, Gerald L. 2006. “About the Cover: Euler and Königsberg’s Bridges: A Historical View.” Bulletin of the American Mathematical Society 43 (04): 567–74. https://doi.org/10.1090/S0273-0979-06-01130-X.
  11. FRS WH. A Comparative View of the Phlogistic and Antiphlogistic Theories.. J. Murray; 1791.
  12. Couper AS. Sur une nouvelle théorie chimique. Annales de chimie et de physique. 1858. pp. 488–489.
  13. Frankland E. Lecture notes for chemical students: Embracing mineral and organic chemistry. J. Van Voorst; 1866.
  14. Cayley P. LVII. On the mathematical theory of isomers. Lond Edinb Dublin Philos Mag J Sci. 1874;47: 444–447.
  15. Sylvester JJ. On an application of the new atomic theory to the graphical representation of the invariants and covariants of binary quantics, with three appendices. Am J Math. 1878;1: 64–104.
  16. Bonchev D. Chemical graph theory: introduction and fundamentals. CRC Press; 1991.
  17. Berzelius J. On the composition of tartaric acid and racemic acid (John’s acid from the Vosges Mountains), on the atomic weight of lead oxide, together with general remarks on those substances which have the same composition but different properties. Poggendorf’s Ann Phys Chem. 1830;19: 305.
  18. Butlerov AM. Ueber die Verwandtschaft der mehraffinen Atome.”. Z Für Chem. 1862;5: 297–304.
  19. Al R, others. Carbohydrate chemistry: Fundamentals and applications. World Scientific Publishing Company; 2018.
  20. Pasteur L. Comptes rendus hebdomadaires de l’Académie des Sciences, Paris. 1848.
  21. Al R, others. Carbohydrate chemistry: Fundamentals and applications. World Scientific Publishing Company; 2018.
  22. Bonchev D. Chemical graph theory: introduction and fundamentals. CRC Press; 1991.
  23. Beeler, Robert A. 2015. “Advanced Counting—Pólya Theory.” In How to Count, edited by Robert A Beeler, 219–55. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-13844-2_8.
  24. Jiping, Zhang. 1992. “On Finite Groups All of Whose Elements of the Same Order Are Conjugate in Their Automorphism Groups.” Journal of Algebra 153 (1): 22–36. https://doi.org/10.1016/0021-8693(92)90146-D.
  25. Kerber, Adalbert, Reinhard Laue, Markus Meringer, Christoph Rücker, and Emma Schymanski. 2013. Mathematical Chemistry and Chemoinformatics. DE GRUYTER. https://doi.org/10.1515/9783110254075.
  26. Pevac, S, and G Crundwell. 2000. “Pólya’s Isomer Enumeration Method: A Unique Exercise in Group Theory and Combinatorial Analysis for Undergraduates.” Journal of Chemical Education 77 (10): 1358. https://doi.org/10.1021/ed077p1358.
  27. Bytautas, L, and D J Klein. 1998. “Chemical Combinatorics for Alkane-Isomer Enumeration and More.” Journal of Chemical Information and Computer Sciences 38 (6): 1063–78. https://doi.org/10.1021/ci980095c.
  28. Dias, Jerry Ray. 1984. “A Periodic Table for Polycyclic Aromatic Hydrocarbons. IV. Isomer Enumeration of Polycyclic Conjugated Hydrocarbons. 2.” Journal of Chemical Information and Computer Sciences 24 (3): 124–35. https://doi.org/10.1021/ci00043a002.
  29. Balasubramanian, Krishnan. 2018. “Combinatorial Enumeration of Isomers of Superaromatic Polysubstituted Cycloarenes and Coronoid Hydrocarbons with Applications to NMR.” The Journal of Physical Chemistry A 122 (41): 8243–57. https://doi.org/10.1021/acs.jpca.8b08784.
  30. Sutherland G. DENDRAL - A computer program for generating and filtering chemical structures. Stanf Artifical Intell. 49: 34.
  31. Serov VV, Elyashberg ME, Gribov LA. Mathematical synthesis and analysis of molecular structures. J Mol Struct. 1976;31: 381–397. doi:10.1016/0022-2860(76)80018-X
  32. Badertscher M, Korytko A, Schulz KP, Madison M, Munk ME, Portmann P, et al. Assemble 2.0: A structure generator. Chemom Intell Lab Syst. 2000;51: 73–79. doi:10.1016/S0169-7439(00)00056-3
  33. Shelley CA, Munk ME. Case, a computer model of the structure elucidation process. Anal Chim Acta. 1981;133: 507–516. doi:10.1016/S0003-2670(01)95416-9
  34. Carhart RE. GENOA: A computer program for structure elucidation utilizing overlapping and alternative substructures. J Org Chem. 1981;46: 1708–1718.
  35. Sasaki SI, Abe H, Hirota Y, Ishida Y, Kudo Y, Ochiai S, et al. CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds. J Chem Inf Comput Sci. 1978;18: 211–222. doi:10.1021/ci60016a007
  36. Gugisch R, Kerber A, Kohnert A, Laue R, Meringer M, Rücker C, et al. MOLGEN 5.0, a Molecular Structure Generator in Advances in Mathematical Chemistry. Adv Math Chem Basak SC Restrepo G Villaveces JL Eds.
  37. Blinov K, Elyashberg M, Molodtsov S, Williams A, Martirosian E. An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations. Fresenius J Anal Chem. 2001;369: 709–714.
  38. Christie BD, Munk ME. Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation. J Chem Inf Comput Sci. 1988;28: 87–93. doi:10.1021/ci00058a009
  39. Madison M, Schulz K, Korytko A, Munk M. SESAMI: An integrated desktop structure elucidation tool. Internet J Chem. 1998;1: CP1–U22.
  40. Korytko A, Schulz K-P, Madison MS, Munk ME. HOUDINI: A New Approach to Computer-Based Structure Generation. J Chem Inf Comput Sci. 2003;43: 1434–1446. doi:10.1021/ci034057r
  41. Bohanec S. Structure Generation by the Combination of Structure Reduction and Structure Assembly. J Chem Inf Comput Sci. 1995;35: 494–503. doi:10.1021/ci00025a017
  42. Yirik MA, Sorokina M, Steinbeck C. MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle. Journal of Cheminformatics. 2021 Dec;13(1):1-4.
  43. McKay BD, Yirik MA, Steinbeck C. Surge-A Fast Open-Source Chemical Graph Generator.
  44. Allard P-M, Péresse T, Bisson J, Gindro K, Marcourt L, Pham VC, et al. Integration of Molecular Networking and In-Silico MS/MS Fragmentation for Natural Products Dereplication. Anal Chem. 2016;88: 3317–3323. doi:10.1021/acs.analchem.5b04804
  45. Djoumbou-Feunang Y, Pon A, Karu N, Zheng J, Li C, Arndt D, et al. CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification. Metabolites. 2019;9: 72. doi:10.3390/metabo9040072
  46. Hufsky F, Böcker S. Mining molecular structure databases: Identification of small molecules based on fragmentation mass spectrometry data. Mass Spectrom Rev. 2017;36: 624–633. doi:10.1002/mas.21489
  47. Allard P-M, Péresse T, Bisson J, Gindro K, Marcourt L, Pham VC, et al. Integration of Molecular Networking and In-Silico MS/MS Fragmentation for Natural Products Dereplication. Anal Chem. 2016;88: 3317–3323. doi:10.1021/acs.analchem.5b04804
  48. Hufsky F, Böcker S. Mining molecular structure databases: Identification of small molecules based on fragmentation mass spectrometry data. Mass Spectrom Rev. 2017;36: 624–633. doi:10.1002/mas.21489
  49. Bohanec S. Structure Generation by the Combination of Structure Reduction and Structure Assembly. J Chem Inf Comput Sci. 1995;35: 494–503. doi:10.1021/ci00025a017
  50. Heinonen M, Rantanen A, Mielikäinen T, Kokkonen J, Kiuru J, Ketola RA, et al. FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. Rapid Commun Mass Spectrom. 2008;22: 3043–3052. doi:10.1002/rcm.3701
  51. Ruttkies C, Schymanski EL, Wolf S, Hollender J, Neumann S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J Cheminformatics. 2016;8: 3. doi:10.1186/s13321-016-0115-9
  52. Heinonen M, Rantanen A, Mielikäinen T, Kokkonen J, Kiuru J, Ketola RA, et al. FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. Rapid Commun Mass Spectrom. 2008;22: 3043–3052. doi:10.1002/rcm.3701
  53. Ridder L, van der Hooft JJJ, Verhoeven S, de Vos RCH, van Schaik R, Vervoort J. Substructure-based annotation of high-resolution multistage MS(n) spectral trees. Rapid Commun Mass Spectrom RCM. 2012;26: 2461–2471. doi:10.1002/rcm.6364
  54. Todeschini R, Consonni V. Handbook of Molecular Descriptors. John Wiley {&} Sons; 2008. Available: https://books.google.com/books?hl=en%7B&%7Dlr=%7B&%7Did=TCuHqbvgMbEC%7B&%7Dpgis=1
  55. Mauri, Andrea, Viviana Consonni, and Roberto Todeschini. 2017. “Molecular Descriptors.” In Handbook of Computational Chemistry, 2065–93. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-27282-5_51.
  56. Mauri, Andrea, Viviana Consonni, and Roberto Todeschini. 2017. “Molecular Descriptors.” In Handbook of Computational Chemistry, 2065–93. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-27282-5_51.
  57. Lovász L, Pelikán J. On the eigenvalues of trees. Periodica Mathematica Hungarica. 1973;3: 175–182.
  58. Todeschini R, Consonni V. Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references. John Wiley & Sons; 2009.
  59. Calingaert G, Hladky JW. A Method of Comparison and Critical Analysis of the Physical Properties of Homologs and Isomers. The Molecular Volume of Alkanes. Journal of the American Chemical Society. 1936;58: 153–157.
  60. Kopp H. Ueber den Zusammenhang zwischen der chemischen Constitution und einigen physikalischen Eigenschaften bei flüssigen Verbindungen. Justus Liebigs Ann Chem. 1844;50: 71–144.
  61. Wiener H. Structural determination of paraffin boiling points. J Am Chem Soc. 1947;69: 17–20.
  62. Hosoya H, Hosoi K, Gutman I. A topological index for the totalπ-electron energy. Theor Chim Acta. 1975;38: 37–47.
  63. Hosoya H, Trinajstic N. Mathematics and Computational concepts in Chemistry. Horwood Chichester. 1986; 110.
  64. Hartmann H. Eine neue quantenmechanische Behandlung von CH4 und NH4+. Z Für Naturforschung A. 1947;2: 489–492.
  65. Randic M. Characterization of molecular branching. J Am Chem Soc. 1975;97: 6609–6615.
  66. Trinajstic N. Chemical graph theory. Routledge; 2018
  67. Balaban AT. Highly discriminating distance-based topological index. Chem Phys Lett. 1982;89: 399–404. doi:10.1016/0009-2614(82)80009-2
  68. Todeschini R, Gramatica P. New 3D Molecular Descriptors: The WHIM theory and QSAR Applications. In: Kubinyi H, Folkers G, Martin YC, editors. 3D QSAR in Drug Design: Ligand-Protein Interactions and Molecular Similarity. Dordrecht: Springer Netherlands; 1998. pp. 355–380. doi:10.1007/0-306-46857-3_19
  69. Consonni V, Todeschini R, Pavan M. Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3D molecular descriptors. Journal of Chemical Information and Computer Sciences. 2002;42: 682–692. doi:10.1021/ci015504a