Correlation of liquid viscosity with molecular structure for organic compounds using different variable selection methods

Improved models for predicting viscosities at 20 ° C were generated using three different methods for descriptor selection. Data set of 361 diverse organic molecules and their experimental viscosities were used for developing the models. Molecular properties are encoded by 822 initial descriptors computed by the CODESSA program. CODESSA, GFA and CROMRsel methods are capable of selecting good and facile viscosity models having only five descriptors. These methods are automated procedures for generation of simple multiregression (MR) models. All three methods produce excellent linear models, but the models obtained by the CROMRsel method are somewhat better. In addition, using the CROMRsel suite of programs a very good nonlinear MR model having five descriptors (two linear and three cross-product descriptors, R 2 = 0.908, S = 0.175) was obtained. Nonlinear models generated in this study show that the classical MR based methods can be efficiently used to obtain simple and very good nonlinear MR models. The best five-descriptor models selected in this study usually contain one geometrical (gravitational index) and one topological descriptor (Randi ć index of order 0), and three electrostatic descriptors which reflect the bonding properties of molecules, i.e. their capabilities to create (mainly) hydrogen bonds. Because of that, hydrogen-donors and hydrogen-acceptors surface areas, charges, total molecular surface areas, and maximum net atomic charges and state energies for oxygen atoms appear to be key factors for modeling the viscosity of organic molecules.


Introduction
Developing models for predicting different not easily measurable experimental properties of molecules is a growing field of research.Progress in this field is accelerated by an enormous increase of computer power, accompanied by development of a number of program packages for graphical and computational manipulation with molecules, as well as for calculation of a huge number of descriptors purely from chemical structure of molecules.Then these easily computable descriptors are used as input information for developing models for computing/predicting values of other not-so-easy-measurable experimental properties of molecules.
Usefulness of molecular modeling methods will be illustrated by modeling the viscosity of liquids, which is one of the most important collective properties of molecules (liquids).Viscosity is one of the most significant transport properties for many chemical engineering problems (e.g.−4 The data set of molecules is the same as in our previous paper. 3Molecular descriptors are the same as in ref. 4, and were computed by the CODESSA program. 5In the model development stage three variable selection methods will be used: (1) heuristic method from version 2.21 of the improved CODESSA program; 5,6 (2) Genetic Function Approximation (GFA) method 7 as implemented in Cerius2 program package, 8 and (3) CROMRsel based approach. 9,10A short description of these methods is given in Experimental section.Here we will compare these three methods for the model generation used in the field of QSPR (quantitative structure-property relationship) in order to rank them according to their usefulness in developing: (1) good (stable and predictable) QSPR models, and (2) facile/straightforward (easily interpretable) models.Some of the descriptor selection methods used in modeling process (like Neural Network Ensemble (NNE)) 11 produce models that are not simple and that are difficult to interpret.We have shown in our recent study that such models can be replaced by simpler multiregression (MR) models, 12 and due to that fact we concentrate here on multiregression based methods.Nonlinear viscosity models will be generated only using the CROMRsel approach.Finally, methods used will be strictly compared (under the same conditions) and models obtained will be critically analyzed from a viewpoint of their goodness and simplicity.

Results and Discussion
In this study, experimental viscosities of 361 structurally diverse organic compounds containing C, H, O, N, S, were used.Each of the three methods (CODESSA, GFA and CROMRsel) was applied to this set of data.Up to 822 molecular descriptors were calculated using the CODESSA program encoding properties of each of the 361 molecules, 4−6 i.e. the same set of descriptors as used in ref. 4. In the CODESSA program, descriptors were computed in a totally automated way solely from the chemical structures of compounds.CODESSA can generate a large number of quantitative descriptors that encode constitutional, topological, geometrical, electrostatic, and quantum chemical characteristics of a molecule.After filtering, 420 descriptors remain.containing two (-2) descriptors.b descriptors are designated according to their order in data set (i.e.d1, …, d420) -the meaning of selected descriptors is given in Table 7. c square of the fitted (R) and the LOO CV (R cv ) correlation coefficient.d fitted (S) and LOO CV (S cv ) standard error of estimate having N = 361 in denominator.Please note that S and S cv in ref. 4 were computed using N -I -1 in denominator, where I stands for the number of descriptors involved in the model.
The best multiregression models obtained by the standard CODESSA (version 2.21) descriptor selection procedure containing two to seven descriptors were generated.Details about these models and their statistical parameters are given in Table 1.Description of heuristic selection procedure used in CODESSA for selection of models is given in Experimental Section.Quality of models is expressed by computing the correlation coefficient and standard error of estimate both in the fitting (R and S) and in the leave-one-out (LOO) cross-validation (CV) procedure(s) (R cv and S cv ).For detailed description of LOO CV procedure see Experimental section or refs.9  In Table 3 the best MR models containing one to five descriptors selected by the GFA method, which is incorporated in Cerius2 program package, are given.The standard GFA procedure was used with default initial parameters as suggested in Cerius2 package.Description of the GFA selection procedure is given in Experimental Section.Details about the best fivedescriptor viscosity model selected by the GFA method are given in Table 4.   2.
The third method used was the CROMRsel procedure for selecting the best possible subsets of descriptors. 9,10This procedure searches over all possible subsets having I (I = 1, …, 5 in this case) descriptors, and, for each I selects the best model.In this case intercorrelation between descriptors involved in the MR model is minimised.Due to this fact, one can expect that such models have the best statistical performance, i.e. fitted, cross-validated and predictive performances.The best one-to five-descriptor models selected by the CROMRsel procedure for selecting the best possible combination (subset) of descriptors are given in Table 5.    6.
Table 6 gives additional details and statistical parameters of the best five-descriptor model obtained by CROMRsel.All descriptors involved in all CODESSA, GFA and CROMRsel models in Tables 1, 3 and 5 are listed and described in Table 7.Additional explanation of descriptors can be found in refs.13 and 14.Calculated values for 361 compounds by the fit and LOO CV procedures versus experimental viscosities using model M-5 from Table 6 are given in Figure 1.Finally, nonlinear modeling using CROMRsel algorithms was performed.Starting from initial 420 descriptors 25 descriptors were preselected by using stepwise CROMRsel procedure called SSP2 (for description see Experimental section and ref .10).Then, nonlinear terms in the form of squares and cross-products of initial 25 descriptors were computed and added to the initial 25 descriptors.Data set created in such a way contained 350 descriptors and was a starting set for generating nonlinear models.Using the standard CROMRsel procedure for the selection of the best possible subset of descriptors, the best one-to five-descriptor models were obtained and details of the best selected models are given in Table 8.One can see that models containing two to five descriptors include initial (linear) descriptors d19 and d22 (except Mnonlin-2 model), as well as cross-products of initial descriptors.In this study only two-fold cross-products were used.The value of a two-fold cross-product descriptor for a molecule is just the product of initial (single) descriptor values of two descriptors calculated for the molecule.Calculated values for 361 compounds by the fit and LOO CV procedures versus experimental viscosities using model Mnonlin-5 from Table 9 are given in Figure 2.   10 show that the models selected in this study are good and stable.Linear models having the same number of descriptors selected by the CROMRsel method are somewhat better in prediction on external set than corresponding models obtained by CODESSA and GFA.Nonlinear models obtained by the CROMRsel procedures are considerably better than the linear ones.The best nonlinear five-descriptor model has the standard error of 0.158 log units on external set, and linear five-descriptor models obtained by CODESSA, GFA and CROMRsel have standard error of about 0.21 log units (Table 10).Moreover, the best nonlinear twodescriptor model has smaller standard error of estimate on external set (0.198 lo units) than the best linear five-descriptor models.

ISSN 1551-7004
Page 54 © ARKAT USA, Inc From the statistical point of view, one can conclude that all the models selected in this study are highly significant and stable.This can be best seen from the calculated standard errors obtained on external set of molecules (Table 10) and from the differences between the fitted (S) and LOO cross-validated (S cv ) standard errors of estimate.Additionally, significance of each descriptor involved in the best five-descriptor models selected in this study can be seen in Tables 2, 4, 6 and 9 through their t-test values.However, one can say that in the case of linear models the CROMRsel method generates better models.The CROMRsel procedure is simple for use because only the size of models is input, which is defined by the number of descriptors that should be involved in the model.The use of CODESSA and GFA methods is also simple, but in these two methods there is at least one parameter more (than in the case of CROMRsel) that should be given or optimised by the user (see the description of selection methods used in GFA and CODESSA 2.21 in Experimental section).In the next paragraph, an example related to the optimization of the significant intercorrelation level between descriptors that can be involved in the CODESSA models is given.On the other hand the CROMRsel procedure is more time consuming (comparing with GFA and CODESSA) in the case of models containing more descriptors.In this study the selection of four-and five-descriptor models used more computer time than the other two methods.
For this data set, models obtained by CROMRsel are the best possible one-to five-descriptor multivariate regression models.Therefore, these models can be used for finding out better input parameters that should be defined by users before running variable selection procedures in the CODESSA and GFA methods.After changing default value of the significant intercorrelation level in CODESSA 2.21 the best five-descriptor model was improved from R 2 = 0.8536 (R cv = 0.8446, F = 414.1) to R 2 = 0.8615 (R cv = 0.8536, F = 441.8).According to the statistical parameters this model is almost as good as the five-descriptor CROMRsel and GFA models.This test indicates that some input parameters in CODESSA (like an intercorrelation level) should be softened, i.e. the selection procedure should be repeated with different values of input parameters in order to improve models.Results presented here are obtained with default input parameters in order to test these three methods 'as they are originally given' without additional optimizations.In the GFA method only default initial parameters were used because it is not easy to know in advance what is the best range of input parameters (like the number of generations and the value of smoothing parameter).
Nonlinear models (having the same number of descriptors and optimized parameters as linear models) generated by the use of CROMRsel algorithms are better than linear ones.An improvement obtained by the best four-and five-descriptor nonlinear models (R cv 2 = 0.884 and R cv 2 = 0.904, respectively) with respect to the previous linear model obtained on the same set of data 4 is evident (the best four-and five-descriptor models in ref. 4 had R cv 2 = 0.852 and R cv 2 = 0.844, respectively).By comparing the type of descriptors involved in the five-descriptors models one can see that each models contains mainly two type of descriptors: (1) geometrical and/or topological descriptors, and (2) several (usually three) electrostatic descriptors which describe the potency of molecule for forming preferably hydrogen bonds.9.
From the first class of descriptors, the gravitational index is involved in each model.In addition, in all five-descriptor models this descriptor is the most significant one (i.e. it has the highest t-test value).The gravitational index reflects the effective mass distribution in the molecule and effectively describes intermolecular dispersion forces in the bulk liquid media (i.e.accounts simultaneously for both the atomic masses and for their distribution within the molecular space). 4The second descriptor from the first class is the Randić index 15 of order 0, which is related to the number of atoms in molecule.In one case (the CODESSA five-descriptor model) instead of the Randić index a descriptor related to the relative number of rings in the molecule is involved.This descriptor accounts for both the size and shape of a molecule.
From the second class of descriptors the most often involved descriptors are those related to the hydrogen-donor(s) and hydrogen-acceptor(s) charged surface areas of molecules, total molecular surface areas, and maximum net atomic charges and state energies for oxygen atoms in organic molecules.These descriptors are related to the hydrogen bonding ability of compounds and to the ability of forming other polar interactions between solute molecules.This reflects that the mass, size, shape as well as the hydrogen bonding ablities of molecules are key factors which govern their liquid viscosity.

Calculation of descriptors.
The procedure for computing molecular descriptors by the CODESSA program is described in ref. 4. By using the CODESSA program, the number of descriptors with up 822 for 361 diverse organic molecules were calculated.However, from the total number of 822 descriptors any descriptor containing more than 80% zeros or identical values was eliminated.By this filtering procedure 402 descriptors were removed, and the set of 420 descriptors remained and was used in the modeling.Zero values were assumed for missing values of each descriptor.In addition, the list of 822 descriptors (and, also, the list of 420 descriptors which were obtained after filtering) and their values for 361 molecules can also be obtained in electronic form on request.
Heuristic descriptor selection method. 4,5In the case of the heuristic method used in CODESSA, the selection of descriptors for multivariate regression models was performed based on the several criteria.The algorithm used several parameters (options), which control the performance of the method.Default values of the parameters were selected to fit the most common situations (these parameters can be easily changed in the respective dialog box).Firstly, descriptors from the "starting set" of descriptors were eliminated if: (a) the F-test's value for the one-descriptor correlation with the descriptor was below 1.0, (b) the squared correlation coefficient of the one-descriptor model was less than R min 2 , (c) the descriptor's t-value was less than t 1 (0.1),(d) the descriptor was highly intercorrelated above r full with another descriptor.The values of R min 2 , t 1 and r full were user specified.In this study the default values of these parameters were used for the models given in Table 1, i.e. 0.01, 0.1 and 0.99 for R min 2 , t 1 and r full , respectively.Then the following procedure followed: (1) Starting with the top descriptor from the preselected list of descriptors all two-descriptor models were calculated for pair of descriptors having intercorrelation coefficient lower than r sig = 0.8. ( The best 10 (n+1)-descriptor models were again submitted to the same procedure, until the multivariate regression model with a certain number of descriptors was obtained.
Genetic Function Approximation (GFA) method. 7The GFA algorithm uses a genetic algorithm to perform a search over the space of possible QSAR/QSPR models using the LOF score as a parameter for estimating the fitness of each model.Such evolution of a population of randomly constructed models leads to the discovery of highly predictive QSARs/QSPRs.Genetic algorithms were derived by analogy with the spread of mutations in a population.According to this analogy "individuals" are represented as a 1D string of bits.An initial population of individuals is created, usually with random initial bits.In the GFA method, models containing a randomly chosen proper subset of the independent variables are collected and then the collected models are "evolved".A generation is the set of models resulting from performing the multiple linear regression on each model.A selection of the best ones becomes the next generation (set of models).Crossover operations are performed on these, which take some variables from each of two models to produce an offspring.In addition, the best model from the previous generation is retained.Besides linear terms, there can also be spline, quadratic and quadratic spline terms.These are added or deleted by mutation operations.A disadvantage is that it is not possible to introduce cross-products of descriptors as nonlinear terms in the GFA method included in Cerius2.Only default values of input parameters were used in GFA (the default values for the number of generations and smoothing parameter were 5000 and 1.0, respectively).
CROMRsel procedure. 9,10For generating linear CROMRsel models the algorithm for selection of the possible subsets of descriptors was used.Detailed description of this algorithm was given in ref. 9 and the final result, which one can obtain by using this CROMRsel algorithm, is selection of the best possible models containing subsets of I descriptors (I = 1, …, 5 in this study).
SSP2 CROMRsel procedure. 9,10For generating nonlinear CROMRsel models, the preselection of descriptors was performed by the Stepwise Selection Procedure denoted as no. 2 in ref. 10 (SSP2).SSP2 selects up to K descriptors (K = 25 in this study) in the '1 by 1' manner (adding one new descriptor to the model in each step).In this procedure the stepwise selection starts from each (j) descriptor (j = 1, 2, ..., N; N = 420 in this study) in the data set, giving N models each with K descriptors.Among them the best one (according to the highest fit correlation coefficient) is chosen.Then, starting from 25 descriptors preselected by SSP2 procedure and their squares and cross-products (350 descriptors altogether), the standard CROMRsel algorithm for selection of the possible subsets of descriptors was used in order to obtain nonlinear models given in Table 8.

Cross-validation.
LOO CV procedure is a procedure usually used for evaluating the model stability.During LOO CV procedure each of N molecules is taken away only once.Using the

Figure 1 .
Figure 1.Scatter plot of the calculated (fit (A) and LOO cross-validated (B)) vs experimental log η for 361 compounds using the linear model from Table6.

Figure 2 .
Figure 2. Scatter plot of the calculated (fit (A) and LOO cross-validated (B)) vs experimental log η for 361 compounds using the nonlinear model from Table9.

Table 1 .
The best two-to eight-descriptor linear multiregression viscosity (log η) models of 361 compounds selected from 420 descriptors by CODESSA (heuristic method from version 2.21) a e.g.MCOD-2 is abbreviation related to the model (M) obtained by CODESSA (COD)

Table 2 .
and 10.Multiregression equation, i.e. regression coefficients, their errors and ttest values, corresponding to the five-descriptor CODESSA model MCOD-5 from Table 1 is given in Table 2.This model is the same as the five-descriptor CODESSA model from ref. 4. The best five-descriptor model (MCOD-5) selected by CODESSA

Table 3 .
The best one-to five-descriptor linear multiregression viscosity models of 361

Table 4 .
The best five-descriptor model (MGFA-5) selected by GFA a, b See footnotes in Table

Table 5 .
The best possible one-to five-descriptor linear multiregression viscosity models of 361 compounds selected by CROMRsel from 420 descriptors a e.g.M-1 is abbreviation related to the model (M) obtained by the CROMRsel selection method containing one (-1) descriptor.b,c, d see footnotes in Table1

Table 6 .
The best linear five-descriptor model (M-5) selected by CROMRsel a, b see footnotes in Table2

Table 7 .
Molecular descriptors involved in the MR viscosity models selected in this study a

Table 8 .
The best possible one-to five-descriptor nonlinear multiregression viscosity models of 361 compounds selected by CROMRsel from 420 descriptors © ARKAT USA, Inc

Table 9 .
The best nonlinear five-descriptor viscosity model (Mnonlin-5) selected by CROMRsel See footnotes in Table2It is evident from Figures1 and 2that the difference between corresponding fitted and LOO CV values is much smaller in the case of the best nonlinear five-descriptor model, indicating model stability.Stability of this five-descriptor model is additionally tested by performing partition of 361 molecules into the set containing molecules 1 -240 from Table 1 in ref. 4 (training set, i.e. set on which model was generated) and the external set containing remaining 121 molecules (241-361) for which viscosities were calculated by the model generated on the training set.Standard errors of estimate of several top models on the training and external sets given in Table

Table 10 .
The stability test of the best selected models by performing an external validation a Abbreviations are related to the models given in Tables1, 3, 5 and 8.

ISSN 1551-7004 Page 57 © ARKAT USA, Inc Experimental Section Data set.
The list of 361 molecules was taken from ref.3.Complete list of molecules (names), experimental viscosity values and viscosities computed using the best linear and non-linear fivedescriptor models from Tables 2, 4, 6 and 9 can be obtained on request.The range of experimental viscosity values of 361 molecules is between 0.164 (trans-2-pentene) and 1490 (glycerol) mPa⋅s (measured at 20 °C).In this study, the logarithmically transformed (log 10 ) viscosity values were used.
) The best 10 two-descriptor models were selected and processed further to the stepwise selection procedure for the development of multidescriptor models.(3)To the multivariate regression models containing n descriptors a new descriptor (having the intercorrelation coefficient with other involved descriptors lower than 0.8) was added to generate a model with n +1 descriptors.