Development of quantitative structure-activity relationship models for vapor pressure estimation using computed molecular descriptors

Vapor pressure is an important property which is an indicator of chemical volatility, along with transport, partitioning, fate and distribution of environmental pollutants. Various models have been developed for the prediction of vapor pressure of chemicals using physicochemical and calculated structural properties. We have used different classes of graph theoretic indices, e.g., topostructural indices, topochemical indices, geometrical (3D) indices and, quantum chemical descriptors, for the development of predictive models for vapor based on a structurally diverse set of 469 chemicals. Initially, a set of 379 molecular descriptors was calculated using the software POLLY , Triplet , Sybyl , MOPAC , and Molconn-Z . Comparatively, three linear regression methodologies were used to develop hierarchical QSAR (HiQSAR) models, namely ridge regression (RR), principal components regression (PCR), and partial least squares (PLS) regression. The results indicate that, in general, RR outperforms PCR and PLS, and that the easily calculated topological descriptors are sufficient for the prediction of vapor pressure based on this large, diverse set of chemicals.


Introduction
The assessment of fate and distribution of environmental pollutants in various phases including air, water, and soil is important for the risk assessment of chemicals. 1The partitioning of chemicals among different phases is usually assessed using a critical list of physical properties including vapor pressure (VP), aqueous solubility, air: water partition coefficient, and octanol: water partition coefficient.
Pollutants with high vapor pressure tend to concentrate more in the vapor phase as compared to soil or water.Therefore, VP is a key physicochemical property essential to the assessment of chemical distribution in the environment.This property is also used in the design of various chemical engineering processes. 2Additionally, VP can be used for the estimation of other important physicochemical properties.For example, one can calculate Henry's law constant, soil sorption coefficient, and partition coefficient from VP and aqueous solubility.
Therefore, it is not surprising that various authors have attempted to model this important physicochemical property using quantitative structure-property relationships (QSPRs) based on calculated molecular descriptors.Katritzky et al used descriptors calculated by CODESSA in the formulation of QSPRs for a diverse set of 411 chemicals. 1Engelhardt et al used topological descriptors and computational neural networks (CNNs) in the formulation of QSPRs for the estimation of VP for a diverse set of 420 organic compounds. 3Liang and Gallagher, 4 along with Staikova et al, 5 used quantum chemically derived indices, polarizability in particular, in the development of QSPRs for vapor pressure estimation.
][8][9][10][11][12][13][14][15][16][17][18] The objective of this HiQSAR/ HiQSPR research has been twofold: description and prediction.The HiQSPR formalism uses progressively more complex indices in the development of models.The type of parameters important for the estimation of a property at each level, e.g., topological, geometrical, and quantum chemical, can be determined and used in order to understand the molecular and submolecular basis of the property (description), and good quality models based on algorithmically derived descriptors can be used for the estimation of the property of interest for any chemical, real or hypothetical (prediction).
Basak et al have used the HiQSPR approach previously in the development of VP prediction models. 12,15However, the current study utilizes an expanded set of descriptors along with three statistical modeling approaches, namely ridge regression (RR), principal components regression (PCR), and partial least squares (PLS) regression, which are appropriate for data sets wherein the number descriptors is large with respect to the number of chemical compounds and when the molecular descriptors are highly intercorrelated.

Experimental Data
The set of 469 chemicals used in this study was obtained from the Assessment Tools for the Evaluation of Risk (ASTER) database 19 and represents a subset of the Toxic Substances Control Act (TSCA) Inventory 20 for which vapor pressure (p vap ) was measured at 25 °C with a pressure range of approximately 3 -10 000 mm Hg.The molecular weights of the compounds in this data set range from 40 to 338, and the chemical diversity is described in Table 1.

Structural descriptors and hierarchical QSAR
In general, a wide variety of molecular descriptors based on chemical structure have been formulated and used in QSAR and QSPR studies. 21,22n the present study, the majority of the topological descriptors were calculated using software including POLLY v. 2.3 23 and Triplet 24 .
The topological descriptors obtained from these programs include Wiener number 25 molecular connectivity indices developed by Randić 26 and Kier and Hall, 27 frequency of path lengths of varying size, 27 information theoretic indices defined on distance matrices of graphs using the methods of Bonchev and Trinajstić, 28 Roy et al., 29 Basak et al., 30,31 as well as those of Raychaudhury et al., 32 parameters defined on the neighborhood complexity of vertices in hydrogen-filled molecular graphs, 10,11,12 and Balaban's J indices [33][34][35] as well as the triplet indices. 24The triplets result from a matrix, main diagonal column vector, and free term column vector which are converted into a system of linear equations.The notation used to represent the vectors and matrices is as follows: A = Adjacency matrix V = Vertex degree S = Distance sum N = Total number of vertices in the graph Z = Atomic number D = Distance matrix 1 = Unity matrix.After the system of N linear equations is solved, the local vertex invariants, x i , are assembled into a triplet descriptor based on one of the following operations: 1. Summation, ∑ i x i 2. Summation of squares, ∑ i x i 2 3. Summation of square roots, 4. Sum of inverse square root of cross-product over edges ij, ∑ ij (x i x j ) -1/2 5. Product, N(Π i x i ) 1/N Additional topological descriptors, including an extended set of molecular connectivity indices, electrotopological state descriptors, 36,37 general polarity descriptors, and hydrogen bonding descriptors, were calculated by Molconn-Z v. 3.50. 38An additional hydrogen bonding parameter was obtained from software developed by Basak et al. 39 In addition, ten geometrical descriptors were calculated, including six kappa shape indices which were also obtained by Molconn-Z v. 3.50.Van der Waals volume, V W , was calculated using Sybyl v. 6.2. 40In addition, two variants of the 3-D Wiener number, 3D W and 3D W H , based on the hydrogen-suppressed and hydrogen-filled geometric distance matrices, respectively, were also calculated by Sybyl v 6.2 using a SPL (Sybyl Programming Language) program developed by our group.
The six quantum chemical descriptors included in the study; namely, E HOMO , E HOMO-1 , E LUMO , E LUMO+1 , ∆Hf, and µ, were calculated for the AM1 semi-empirical Hamiltonian using MOPAC v. 6.0 41 in the Sybl interface. 40 complete list of the 379 parameters calculated for use in the current study, including brief descriptions, is provided in Table 2.Note that the topological descriptors are partitioned into two classes: topostructural, which are based solely on the connectivity of atoms within a molecule, and topochemical, which encode chemical as well as topological information.From the initial set of descriptors listed in Table 2, the following descriptors were removed and not used in the subsequent analyses: 1) Any descriptor with a constant value for all of the 469 chemicals in the data set, 2) one descriptor of each perfectly correlated pair (i.e., r = 1.0), as determined by the CORR procedure of the SAS statistical package, 42 and any descriptors with undefined values.A total of 268 descriptors were retained and subsequently used in model development.Triplet index from adjacency matrix, distance sum, and vertex degree; operation y = 1-5 DSV y Triplet index from distance matrix, distance sum, and vertex degree; operation y = 1-5 ANV y Triplet index from adjacency matrix, graph order, and vertex degree; operation y = 1-5

Topochemical (TC)
O Order of neighborhood when IC r reaches its maximum value for the hydrogen-filled graph O orb Order of neighborhood when IC r reaches its maximum value for the hydrogensuppressed graph I orb Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices IC r Mean information content or complexity of a graph based on the r th (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph SIC r Structural information content for r th (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph CIC r Complementary information content for r th (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph Balaban's J index based on bond types J X Balaban's J index based on relative electronegativities J Y Balaban's J index based on relative covalent radii HB 1  Hydrogen bonding parameter AZV y Triplet index from adjacency matrix, atomic number, and vertex degree; operation y = 1-5 AZS y Triplet index from adjacency matrix, atomic number, and distance sum; operation y = 1-5 ASZ y Triplet index from adjacency matrix, distance sum, and atomic number; operation y = 1-5 AZN y Triplet index from adjacency matrix, atomic number, and graph order; operation y = 1-5 ANZ y Triplet index from adjacency matrix, graph order, and atomic number; operation y = 1-5 DSZ y Triplet index from distance matrix, distance sum, and atomic number; operation y = 1-5 DN 2    We have used the hierarchical QSAR (HiQSAR) approach to model development in which increasingly more complex and computer-resource intensive classes of structural descriptors are used in a graduated manner, first utilizing the topostructural (TS) descriptors alone, followed by the addition of the topochemical (TC) descriptors, the 3-dimensional (3D) descriptors, and finally the quantum chemical (QC) descriptors.The predictive ability of the resulting models, based on the cross-validated R 2 values, are compared in order to determine whether or not the more complex descriptors are necessary in order to predict the property or activity of interest, or if the easily calculable descriptors are sufficient.For comparative purposes, predictive models based on each descriptor class, alone, were also developed.

Statistical methodology
Prior to analysis, all calculated descriptors were transformed by: ln (x + c), where x represents the original descriptor value and c is a constant added to avoid possible arithmetic error.In most cases, c = 1, as the original descriptor values are generally greater than -1.A small number of descriptors, however, have minimum values less than or equal to -1, in which case the constant added was the smallest natural number that would provide a positive sum in the equation above.
For comparative purposes, three regression methodologies were used for the development of predictive models, namely ridge regression (RR), 43 principal components regression (PCR), 44 and partial least squares (PLS). 45Each of these methodologies makes use of all available descriptors, as opposed to subset regression, and is useful when the number of descriptors is large with respect to the number of compounds and when the descriptors are highly intercorrelated.Formal comparisons have consistently shown that using a subset of available descriptors is less effective than using alternative regression methods that retain all available descriptors, such as RR, PCR, and PLS. 45,46RR is similar to PCR in that the independent variables are transformed to their principal components (PCs).However, while PCR utilizes only a subset of the PCs, RR retains them all but downweights them based on their eigenvalues. 43ith PLS, a subset of the PCs is also used, but the PCs are selected by considering both the independent and dependent variables.For each model developed, the cross-validated R 2 was obtained using the leave-one-out approach and can be calculated as follows (eq.1): where PRESS is the prediction sum of squares and SSTotal is the total sum of squares.For the sake of brevity, the highly parameterized models are not included in this paper, rather the crossvalidated R 2 and PRESS statistic are reported for each model.Another useful statistical metric is the t value associated with each model descriptor, defined as the descriptor coefficient divided by its standard error.Descriptors with large | t | values are important in the predictive model and, as such, can be examined in order to gain some understanding of the nature of the property or activity of interest.
It should be strongly stated that ordinary least squares (OLS) regression is inappropriate when the number of descriptors is large with respect to the number of chemical compounds in the data set, and that the conventional R 2 metric is without value in this situation.Unlike R 2 , which tends to increase upon the addition of any descriptor, the cross-validated R 2 tends to decrease upon the addition of irrelevant descriptors and is a reliable measure of model predictability. 47Unlike ordinary least squares regression, the number of descriptors is not an issue with the regression methodologies used in the present study.The number of descriptors included in the regression models developed in this study is as follows: TS (99), TS+TC (252), TS+TC+3D (262), TS+TC+3D+QC (268), TC (99), 3D (10), QC (6).RR, PCR, and PLS are appropriate methodologies when the number of descriptors exceeds the number of observations, and they are designed to utilize all available descriptors, as opposed to subset regression, in order to produce an unbiased model whose predictability is accurately reflected by the R 2 cv , regardless of the number of independent variables in the model.The distinction between these methods and OLS regression is important and cannot be overemphasized.

Results and Discussion
The major objective of this paper was the estimation of vapor pressure of chemicals using molecular descriptors that can be computed directly from molecular structure without the input of any other experimental data.To this end, we used topostructural, topochemical, geometrical, and quantum chemical descriptors in the formulation of HiQSPR models for log 10 (p vap ).All models developed in this study are based the complete set of 469 structurally diverse chemicals.
Results in Table 3 indicate that the combination of TS and TC descriptors resulted in a highly predictive RR model (R 2 c.v = 0.895) .; the addition of three dimensional and quantum chemical indices to the set of independent variables did not result in significant improvement in model quality.It may be noted that we have observed such results for various other physicochemical and biological properties including mutagenicity, 10,48 boiling point, 49 blood:air partition coefficient,50 tissue: air partition coefficient, 51 etc. 6,14,16,18 Only in limited cases, e.g., halocarbon toxicity, 9 the addition of quantum chemical indices after TS and TC parameters resulted in significant improvement in QSAR model quality.
It is interesting to note that of the three linear regression methods used, viz.RR, PCS, and PLS, ridge regression outperformed the other two methods significantly.This is in line with our earlier observations with HiQSARs using the three methods.It is instructive to look at the top ten molecular descriptors, based on t value, in the ridge regression VP model derived from TS + TC indices (Table 4).They can be looked upon as representing the following features: a) size (totop, DN 2 Z 1 ), hydrogen bonding (HB 1 , SHHBa), c) polarity (Qv), d) heterogeneity of atom types (IC 0 ), and e) presence of various types of hetero atoms and functional groups (SssO, SsF, SsNH 2 , SaaO).
In the LSER approach, a combination of molecular size, hydrogen bonding, and polarity are used to estimate partitioning behavior of chemicals. 53,54The presence of specific hetero atoms, functional groups and different atom types, as encoded by information theoretic, triplet, and electrotopological indices, will probably be related to dipole-dipole interactions among the molecules and also specific regional interactions such as hydrogen bonding.Such factors have been found to be useful in predicting VP by Liang and Gallagher, 4 Katritzky et al, 1 Engelhardt et al, 3   In conclusion, the VP prediction model for a diverse set of organic chemicals derived from easily calculated molecular descriptors gave very good results.Such a model could be useful in the estimation of vapor pressure of chemicals that fall within the structural types considered in this paper.
h χ b Bond path connectivity index of order h = 0-6 h χ b C Bond cluster connectivity index of order h = 3-6 h χ b Ch Bond chain connectivity index of order h = 3-6 h χ b PC Bond path-cluster connectivity index of order h = 4-6 h χ v Valence path connectivity index of order h = 0-10 h χ v C Valence cluster connectivity index of order h = 3-6 h χ v Ch Valence chain connectivity index of order h = 3-10 h χ v PC Valence path-cluster connectivity index of order h = 4-6 J B and Staikova et al.5

Table 1 .
Chemical class composition of the vapor pressure data set

Table 2 .
Symbols, definitions and classification of calculated molecular descriptors y Triplet index from adjacency matrix, graph order, and distance sum; operation y = 1-5 AN1 yTriplet index from adjacency matrix, graph order, and number 1; operation y = 1-5

Table 2 .
Continued ANN yTriplet index from adjacency matrix, graph order, and graph order again; operation y = 1-5 ASV y

Table 4 .
Important topological descriptors for the prediction of vapor pressure, based on t value, from the TS+TC ridge regression model