Prediction of partitioning properties for environmental pollutants using mathematical structural descriptors

Predictive models, based solely on molecular structure, were developed for three environ-mentally-related partitioning properties: Water solubility, soil/sediment partition coefficient, and octanol/water partition coefficient. Data for a diverse set of 136 chemicals were taken from the literature, and include aromatic and aliphatic compounds, as well as herbicides, pesticides, and polycyclic aromatic hydrocarbons. The hierarchical QSAR (HiQSAR) approach to model building was employed, in which increasingly more computer-resource intensive classes of structural descriptors are used only when the simpler and more easily calculable descriptors do not provide adequate models. The results indicate that the simple topostructural (TS) and topochemical (TC) descriptors provide the best models, and that, in many cases, these structure-based models are superior to those based on properties.


Introduction
Modern lifestyle in the industrialized world is dependent upon the use of thousands of chemicals for various industrial processes as well as for special purposes as drugs, pesticides, herbicides, etc.In the United States, the Toxic Substances Control Act (TSCA) Inventory currently has over 75,000 chemicals of which over 2,800 are high production volume chemicals (HPVs). 1 These chemicals may be released to the environment during their production, transport, and intended uses.Pollutants can also be released into the environment from underground storage tanks, hazardous waste disposal sites, municipal landfills, and accidental spills.The Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA) priority list contains 275 chemicals, many of which are found at facilities on the National Priority List (NPL). 2 These chemicals pose a substantial threat to human, wildlife, and ecological health.
Understanding the distribution of pollutants among different environmental phases is crucial to their hazard assessment and remediation of contaminated sites.Many contaminants end up in the soil and sediment.Remediation often involves extraction of the polluting chemicals into the aqueous phase and then treatment by physical, chemical or biological processes.The extraction methodology is critically dependent on the partitioning properties of the chemicals. 3hysicochemical properties such as octanol/water partition coefficients (K ow ) and aqueous solubility (S) have been used in the estimation of partitioning of chemicals among various environmental phases. 3,4Reversed-phase high performance liquid chromatography has also been used in the prediction of S and K ow .Whereas such property-property correlations designed to estimate properties of environmental interest from other known physicochemical properties work reasonably well, this approach is limited by the unavailability of the latter properties for the majority of chemicals of environmental concern. 5Various studies have shown that properties such as S 6 and K ow 7,8 can be predicted using mathematical molecular descriptors, which can be calculated directly from chemical structures alone without the input of any other experimental data.Such molecular descriptors quantify aspects of structure, which contribute to interactions of chemicals with hydrophobic and hydrophilic phases.0][11] Predictive models based on calculated descriptors can provide cost effective and rapid estimates of partitioning behavior of environmental pollutants.They can also provide insight into the environmental behavior of chemicals not yet synthesized or those that cannot be examined experimentally due to their extremely hazardous nature.Chu and Chan used K ow to predict S and soil/sediment partition coefficients (K oc ) of a diverse collection of pollutants, viz., aliphatics, aromatics, pesticides, herbicides, and polycyclic aromatic hydrocarbons (PAHs). 3They also developed predictive models for K oc based on solubility.As stated earlier, such property-property correlation methods are of limited applicability.Therefore, we were interested to investigate whether properties of environmental interest can be estimated from molecular structural descriptors.We have formulated a hierarchical quantitative structure-activity relationship (HiQSAR) approach where calculated descriptors are used in a graduated manner such that computationally more resource intensive parameters are used only when easily calculable indices do not provide acceptable results.We have carried out a comparative study of physicochemical properties vis-à-vis theoretically based HiQSAR approach in the estimation of partitioning of chemicals of environmental concern.

Experimental data
Chu and Chan 3 collected data for water solubility (S), octanol/water partition coefficient (K ow ), and soil/sediment partition coefficient (K oc ) from several sources including an EPA report on ground water remediation and the Handbook of Environmental Data on Organic Chemicals. 12,13hile Chu and Chan selected 148 compounds, we omitted 12 from their collection, resulting in a total number of 136 chemicals in our data set.The omitted compounds include: a) isomers that are indistinguishable with respect to our software (1,2-Dichloroethene, Hexachlorocyclohexane), b) compounds with fewer than three non-hydrogen atoms, for which our complete set of descriptors cannot be calculated (chloromethane, iodomethane), c) mixtures (chlorotoluene, cresol, xylene), and d) those compounds that contain atoms not represented in our software (cacodylic acid).Based on Chu and Chan's classification scheme, the 136 chemicals were partitioned into five categories: aliphatics (26), aromatics (43), herbicides (18), polycyclic aromatic hydrocarbons (19), and pesticides (30).The data are provided in Table 1.

Structural descriptors
Several software programs, including POLLY v. 2.3, 14 Triplet, 15,16 and Molconn-Z v. 3.5, 17 were used to calculate a set of descriptors based solely on molecular structure.The descriptors numerically represent various aspects of the chemical structure and can be classified into one of three categories based on level of complexity: Topostructural (TS), Topochemical (TC), and 3dimensional/geometrical (3D).The TS are the simplest in that no chemical information is encoded, with molecular structure viewed only in terms of atom connectivity.The TC descriptors, in addition to encoding information about how the atoms are connected within the molecule, also take chemical information into account, including atom type and bond type.The most complex of the three descriptor classes is the 3D, which encodes information on the 3dimensional aspects of molecular structure.TS, TC, and 3D descriptors were used in a hierarchical manner in order to identify any model improvement upon addition of increasingly complex descriptor classes.For comparative purposes, single-class models were also developed.Table 2 contains a complete list of the calculated TS, TC, and 3D descriptors.From this set, the following descriptors were removed and not used in the subsequent analyses: 1) Any descriptor with a constant value for all, or most all, of the136 chemicals in the data set, 2) one descriptor of each perfectly correlated pair (i.e., r = 1.0), as determined by the CORR procedure of the SAS statistical package, 18 and any descriptors with undefined values.A total of 260 descriptors were available for modeling.Triplet index from adjacency matrix, graph order, and graph order again; operation y = 1-5 ASV y

Table 2. Symbols, definitions and classification of calculated molecular descriptors
Triplet index from adjacency matrix, distance sum, and vertex degree; operation y = 1-5 DSV y Triplet index from distance matrix, distance sum, and vertex degree; operation y = 1-5 ANV y Triplet index from adjacency matrix, graph order, and vertex degree; operation y = 1-5

Topochemical (TC)
O Order of neighborhood when IC r reaches its maximum value for the hydrogen-filled graph O orb Order of neighborhood when IC r reaches its maximum value for the hydrogensuppressed graph I orb Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices IC r Mean information content or complexity of a graph based on the r th (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph SIC r Structural information content for r th (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph CIC r Complementary information content for r th (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph

Statistical methodology
Each of the descriptors was transformed by the natural logarithm prior to model development, as their scales differed by several orders of magnitude.In order to avoid possible arithmetic error, a constant was added to the descriptor before log transforming.For descriptors with minimum values less than -1, the constant added was the smallest natural number that would provide a positive sum.For descriptors with minimum values greater than -1, the constant '1' was used.The dependent variables, i.e., S, K oc , and K ow , were also scaled by the natural logarithm.(The log scaled descriptors are available as supplemental material.)For comparative purposes, results are reported based on two regression methodologies for the development of predictive models for each endpoint, namely ridge regression (RR) 19 and partial least squares (PLS). 20Both methodologies make use of all available descriptors, as opposed to subset regression, and are useful when the number of descriptors exceeds the number of compounds in the data set (i.e., rank deficient data) and when the descriptors are highly intercorrelated.Formal comparisons have consistently shown that using a subset of available descriptors is less effective than using alternative regression methods that retain all available ISSN 1424-6376 Page 70 © ARKAT USA, Inc descriptors, such as RR and PLS, and deal with rank deficiency in another way. 21,22With ridge regression, the descriptors are first transformed to their principal components (PCs).All PCs are retained but are "shrunk" differentially according to their eignevalues. 19For each model developed, the cross-validated R 2 was obtained using the leave-one-out approach and can be calculated as follows (eq.1): where PRESS is the prediction sum of squares and SSTotal is the total sum of squares.
It should be strongly stated that ordinary least squares (OLS) regression is inappropriate for use with rank deficient data, and that the conventional R 2 metric is without value in this situation.Unlike R 2 , which tends to increase upon the addition of any descriptor, the cross-validated R 2 tends to decrease upon the addition of irrelevant descriptors and is a reliable measure of model predictability. 23R and PLS models based on structural descriptors were developed for each of the five chemical subsets as well as for the combined set of 136 compounds.For comparative purposes, we also developed property-based models for the prediction of: a) S, based on K ow , b) K oc , based on K ow , and c) K oc , based on S. The SAS statistical package 18 was used to develop these ordinary least squares models, a methodology appropriate for the number of independent variables with respect to the number of observations.

Results and Discussion
The major objective of the study reported in this paper is to compare the relative effectiveness of physicochemical vis-à-vis calculated structural descriptors in the estimation of partitioning properties of chemicals of environmental concern.
For the sake of brevity, the many highly-parameterized models are not reported.However, Tables 3-7 provide the associated cross-validated R 2 values for the five chemical subsets, while the cross-validated R 2 values for the combined data are found in Table 8.In all cases, there is no significant improvement in model quality when the more complex 3D descriptors are added to the topological (i.e., TS and TC) descriptors.
When examining the regression results, it's important to keep in mind that while R 2 is necessarily a nonnegative number, this is not true of the cross-validated R 2 , which can take on negative values if the model is extremely poor (see eq. 1).
With respect to water solubility, there is improvement in model quality upon the addition of TC descriptors to the TS indices (especially pronounced with the aromatics, herbicides, and pesticides) except in the case of the aliphatic subset, for which the TS descriptors, alone, provide the best water solubility model.The best solubility model for the total set of 136 compounds is the TS+TC model, with a cross-validated R 2 value of 0.739 (Table 8).High-quality models obtained for the various subsets include the TC model for the aromatics with a cross-validated R 2 value of 0.905 (Table 4), and the TS+TC model for the polyaromatic hydrocarbons with a crossvalidated R 2 value of 0.808 (Table 6).8).
The statistical analyses of the complete set of 136 compounds revealed a number of compounds with high influence upon the models.These were considered, independently, for each of the three endpoints, and additional models for the combined set of chemicals were developed omitting these compounds as outliers (Table 9).For the S model, paraquat, cylclophosphamide, and dechlorane were omitted.These same compounds were omitted from the K ow model, in addition to trifluralin, kepone, and trichlorofon.From the K oc model, diethylstilbestrol, trifluralin, 1,2:7,8-dibenzopyrene, cyclophosphamide, kepone, dechlorane, and trichloron were omitted.With the removal of these outliers, model improvement was observed.E.g., with respect to the TC models, the cross-validated R 2 values improved from 0.735 to 0.846 for the S model, from 0.720 to 0.790 for the K oc model, and from 0.570 to 0.865 for the K ow model.The results of the comparative property-based models are summarized in Table 10.The structure-based models were superior to the property-based models for the herbicides, pesticides, aromatics, and the combined set of compounds.In addition, it is the TS and TC descriptors that provide the best structural models for these chemical subsets.It should be noted that the TS descriptors alone provide the best solubility model for the aliphatic subset.The property-based models are superior to the structure-based models for the aliphatics and the polyaromatic hydrocarbons.With respect to the K oc models, the herbicides, aromatics, and the combined set of chemicals are better modeled with the structure-based descriptors, while the pesticides, aliphatics, and polycyclic aromatic hydrocarbons are better modeled with K ow and solubility.It is of interest to note that the 3D descriptors provide the best structure-based models for the aliphatics and the aromatics, with respect to the prediction of K oc .

1 A 2 A
the magnitudes of distances between all possible pairs of vertices of a graph I W D Mean information index for the magnitude of distance W Wiener index = half-sum of the off-diagonal elements of the distance matrix of a graph I the distance matrix partitioned by frequency of occurrences of distance h M Zagreb group parameter = sum of square of degree over all vertices M Zagreb group parameter = sum of cross-product of degrees over all neighboring (connected) vertices h χ Path connectivity index of order h = 0-10 h χ C Cluster connectivity index of order h = 3-6 h χ PC Path-cluster connectivity index of order h = 4-6 h χ Ch Chain connectivity index of order h = 3-10 P h Number of paths of length h = 0-10 h χ b Bond path connectivity index of order h = 0-6 h χ b C Bond cluster connectivity index of order h = 3-6 h χ b Ch Bond chain connectivity index of order h = 3-6 h χ b PC Bond path-cluster connectivity index of order h = 4-6 h χ v Valence path connectivity index of order h = 0-10 h χ v C Valence cluster connectivity index of order h = 3-6 h χ v Ch Valence chain connectivity index of order h = 3-10

Table 2 .
Contuined SHavin E-State of C atoms in the vinyl group, =CH-, bonded to an aromatic C SHarom E-State of C sp 2 which are part of an aromatic system

Table 2 .
Contuined SHHBd Hydrogen bond donor index, sum of Hydrogen E-State values for -OH, =NH,-NH 2 , -NH-, -SH, and #CH SHwHBd Weak hydrogen bond donor index, sum of C-H Hydrogen E-State values for hydrogen atoms on a C to which a F and/or Cl are also bonded

Table 3 .
Regression results for the aliphatic subset (N = 26)

Table 4 .
Regression results for the aromatic subset (N = 43)

Table 5 .
Regression results for the herbicide subset (N=18) values of 0.927 and 0.944, respectively.The best K ow model for the total set of 136 compounds was the TC model, with a cross-validated R 2 value of 0.570 (Table

Table 9 .
Regression results for combined data sets, with outliers removed with respect to each endpoint