QSAR study on murine recombinant isozyme mCAXIII: topological vs structural descriptors

The paper describes a first novel QSAR study on murine recombinant isozyme mCAXIII. A comparative study on modeling of inhibition of this isozyme is made using a series of topological, structural descriptors as well as their combinations. The results have shown that distance-based topological indices yield significantly better models than the structural descriptors and the combination of topological and structural descriptors. In all the three cases the Balaban type indices played a dominating role.


Introduction
Carbonic anhydrases (CAs, EC 4.2.1.1)occupy a special place among the In-metallo-enzymes and extensively studied by us using topological indices [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19] .The reasons for such an extensive study being CAs are involved in crucial physiological processes.Consequent to the important role of CAs, their inhibition by carbonic anhydrase inhibitors may be explored for designing useful drugs in the management and prevention of many diseases [20][21][22][23][24][25][26][27][28][29][30][31] .The isozymes mCAXIII, which show a cytosolic sub cellular localization, is one among to 15 isozymes presently known in humans.It is also a sulfonamide inhibitable isozyme but is available in quite limited amount and this is the reason that a small number of sulfonamides has been tested.Consequently, less QSAR studies have been made / or are available in the literature related to mCAXIII.However, it is interesting to record here that mCAXIII plays an important role in pH regulation in the reproductive tract of both females and males.Such a potential of mCA is recently recognized 30,31 .
Recently Supuran has reported 31 inhibition of mCAXIII with aromatic and heterocyclic sulfonamides.Out of the 32 inhibitors used by Supuran only 14 could be tested for mCA XIII.Furthermore, no QSAR study on this class of isoenzyme employing topological indices is reported in the literature.This is the main reason to carry out the first QSAR study on mCA XIII inhibitor.In the present study, therefore, we have used mCAXIII data with 14 sulfonamides 30,31 , the structural details of these sulfonamides are given in Table 1.The topological indices, including Balaban and Balaban type indices were used and their calculation was made using DRAGON software 32 .These topological indices are presented in Tables 2 and 3.In addition, we have also calculated structural descriptors using ACD Labs software 33 and the same are recorded in Table 4.This set of descriptors is also used for modeling inhibitory activity of mCAXIII.Statistically significant modes were then obtained using stepwise regression analysis adopting maximum-R 2 method 34 .The results are discussed below.W-Wiener index 44 ; o χ, 1 χ, 2 χ -zero-, first-, and second-order Randic connectivity indices 45,46 ; 0 χ v , 1 χ v and 2 χ v -zero-, first-and second-order Kier and Hall Valence connectivity indices 47,48 .

Results and Discussion
Since the objective of our investigation is to work out relative correlation potential of topological indices at one hand and the structural descriptor at the other hand for modeling inhibitory activity, logK i (mCAXIII), we have attempted QSAR study under following three headings: (1) QSAR study for modeling logK i (mCAXIII) using distance-based topological indices including Balaban and Balaban type indices, (2) QSAR study for modeling logK i (mCAXIII) using structural descriptors, and (3) QSAR study for modeling logK i (mCAXIII) based on combinations of topological and structural descriptors.
We now discuss these three types of QSAR studies. (

1) Topological modeling of logK i (mCAXIII)
The preliminary regression analysis has indicated that the inhibitory activity of mCAXIII i.e. logK i (mCAXIII) can be successfully modeled even in mono-parametric regression using J hetv as the correlating parameter.Other topological indices are incapable of modeling this activity.This mono-parametric model is found as below: logK i (mCA XIII) = -0.744+ 0.773 (±0.151)J hetv (1) N = 14, Se = 0.283, R = 0.829, F = 26.353,Q = 2.930 The positive coefficient of J hetv indicates that increase in van der Waals weighted distance is favorable for the exhibition of the activity.
Step-wise regression has indicated that addition of Wiener index, W; in the above model [eq.( 1)] yields a model with dramatically improved statistics.The resulting bi-parametric model is found as below: logK i (mCA XIII) = 1.116 + 0.228 (± 0.107) J hetv -9.331 × 10 -4 (± 1.390 × 10 -4 ) W (2) N = 14, Se = 0.131, R = 0.969, R 2 A = 0.927, F = 84.111,Q = 7.397This means that the two variable regressions yielded an excellent model.The negative sign of W may probably due to high collearnity between J hetv and W. Such problems of colleanirity and how to deal with them are discussesd separately in the following section.However, occurrence of W in the above model does indicate that size, shape, and branches have significant effect on the exhibition of logK i (mCA XIII).
Looking to the sample size (14 compounds) and following the rule of thumb 35,36 we can at the most go for tri-parametric regression analysis.In doing so, we observed that addition of 1  (first-order valance connectivity index) to the above model [eq.( 2)] gave a tri-parametric model as below: logK i (mCA XIII) = 0.645 + 0.259 (± 0.093) J hetv -1.391 × 10 -3 (± 2.370 × 10 -4 ) W + 0.112 (± 0.068) 1 χ v (3) N = 14, Se = 0.112, R = 0.979, R 2 A = 0.947, F = 79.142,Q = 8.741The physical significances of J hetv and W terms involved in eq. ( 3) are the same as discussed above for eq.( 2).The positive coefficient of 1 χ v in this eq.( 3) indicates that the presence of heteroatom and first-order branching is favorable for the exhibition of logK i (mCAXIII).The sample size (14 compounds) did not permit us to go to higher parametric regression analysis.However, when we did so following tetra-parametric model resulted by the addition J hetp : logK i (mCA XIII) = 0.013 + 1.341 (± 0.446) J hetv -1.336 × 10 -3 (± 1.943 x 10 -4 ) W + 0.181 (± 0.057) 1 χ v -0.798 (± 0.324) J hetp (4) N = 14, Se = 0.091, R = 0.988, R 2 A = 0.965, F = 89.813,Q = 10.858This four-parametric model for a set of 14 compounds made us to critically examine the rule of thumb 35,36 .This rule argues that multiple regression analysis generally requires significantly more compounds than parameters; a useful rule of thumb is three to six times the number of parameters under consideration.Hence, in case the lower limit of the rule of thumb is in favor of this four-parametric model.Obs.log Ki(mCAXIII) Est.log Ki(mCAXIII) Figure1.Correlation of observed and calculated logK i (mCAXIII) using eq (4).
The physical significances of J hetv , W, and 1 χ v terms involved in eq. ( 4) are the same as above [eq.( 3)].The negative coefficient of J hetp indicates that polarizable weighted distance has a negative effect on the exhibition of logK i (mCAXIII).
The aforementioned results prompted us to comment on the intercept (i.e.constant terms), which is approaching zero as we go from mono-to tri-parametric regressions.The constant term gradually approaches to its ideal value of zero.This means that no systematic error occurred in the calculation of logK (mCAXIII).The constant term approaching the ideal value of zero also indicates that sufficient range of compounds is used in obtaining the model.We now proceed to discuss the modeling of logK i (mCA XIII) using structural descriptors.
(2) Modeling of logK i (mCA XIII) using structural descriptors Preliminary regression analysis using structural descriptors indicated that MW is the promising parameter to be used in multi-parametric regression analysis.Out of the several structural parameters used only MW yields a sufficiently good model: logK i (mCAXIII) = 2.686-5.389×10-3 (±1.509×10 -3 )MW (5) N = 14, Se = 0.353, R = -0.718,F = 12.750, Q = -2.034The physical significance of this and following models will be discussed separately in the latter part of this modeling.
Addition of d parameter to the above model [eq.( 5)] resulted into yet another model with dramatically improved statistics.This model is found as below: Here also, the constant term goes on decreasing as we go from mono-to higher-parametric regressions.This again indicates that no systematic error occurred in modeling the activity and that sufficient range of compound is used in obtaining the model.

(3) Modeling of logK i (mCA XIII) based on the combination of topological and structural descriptors
The present investigation will not be complete and justified unless we make further investigation to model logK i (mCAXIII) using combinations of topological and structural descriptors.Out of several such attempts we observed that the following tetra-parametric model yielded best results: logK i (mCAXIII) = 0.161 + 1.319(0.572)JhetV -0.867(± 0.429) JhetP -1.4020 -3 (±3.505×10 -4 ) W + 0.037 (± 0.0 21) α (9) N = 14, Se = 0.115, R = 0.981, R 2 A = 0.944, F = 56.027,Q = 8.530A comparison of this model [eq.[( 9)] with those models represented by eq. 4 and 8 expressed that it is better than the model expressed by eq. ( 8) but is slightly worse than the model expressed by eq. ( 9).This means that in comparison to structural descriptors the topological descriptors are better suited for modeling logK i (mCAXIII).

Problem of colinearity
To arrive at the final conclusion it is necessary to examine the presence of co linearity, if any, in the proposed models.The simplest way is to obtain correlation matrix in each case.The perusal of Table 6 indicates that all the proposed models suffer from the defect due to massive co linearity.Another way is to examine the Durbin-Watson test 37,38 .For this the obtained Durbin-Watson D term is used to obtain lower and upper d values i.e. to obtain the values of dl and du from the Durbin-Watson parameter D (Table 7).This can be done using some standard statistics book 34 .The results are summarized below.Thus, the Durbin-Watson test fail to give a definite conclusion regarding the presence / absence of multi-collinearly in the proposed modes.Therefore, we have to make use of the recommendations made by Randic 39,40 for resolving the problem of co linearity.Obs.logKi(mCAXIII) Est.logKi(mCAXIII) Figure 3. Correlation of observed and calculated logK i (mCAXIII) using eq.( 9).Randic 39,40 stated that if a descriptor strongly correlates with another descriptor already used in a regression, such a descriptor in most studies should be discarded.For example 1 χ and 2 χ, 1 χ often strongly correlates and in many structure-property-activity studies 2 χ has been discarded.This is not theoretically justified and despite the widespread practice should be stopped.Although two highly correlated descriptors overall depict the same features of molecular structure, it is important to recognize that even highly interrelated descriptors differ in some other structural traits.The difference between them may be relatively small but nevertheless very important for structure-property regression.
The criteria for inclusion or exclusion of descriptors should not be based on parallelism between descriptors even if overwhelming, but should be based on whether the part in which two descriptors disagree is or is not relevant for the characterization of the property considered .If the part in which the second descriptor differ from the first, regardless of how small it is, is relevant for the property under consideration, then the descriptor should be included.Randic 39,40 further stated that the selection of descriptors to be used in structure-property-activity studies should not be delegated solely to computers, although statistical criteria will continue to be useful for preliminary screening of descriptors taken from a large pool.Often in an automated selection of descriptors, a descriptor will be discarded because it is highly correlated with another descriptor already selected.But what is important is not whether two descriptors parallel one another; i. e. duplicates much of the same structural information, but whether they are complementary in those parts that are important for structure-property-activity correlations.Hence, the residual of the correlation between two descriptors should be examined and kept or discarded depending on how well it can improve the correlation based on already selected descriptors.

Predictive power of the model
The predictive power of the model is judged by obtaining predictive correlation coefficients R 2 pred .This is done by plotting a graph between observed and calculated logK i (mCAXIII).We have chosen models represented by eqns.4,8 and 9.No outlier existed in any of these models.In case of models based on topological indices only, the R 2 pred was found to be 0.976.while for the model based on structural descriptors only the R 2 pred was 0.8692.The R 2 pred value based on the combination of both the types if descriptors was found to be 0.9614.Once again we observed that topological indices exhibit better predictive power compared to structural descriptors.The predictive power is further confirmed by calculating Pogliani's quality factor (Q) [41][42][43] .The Q values are reported under each of the proposed models indicating that the predictive power goes on increasing as we go from mono-to tetra-parametric models and is highest for the latter.

Model validation
With this much discussion focusing on the process of solving the problem of interactive between variables and co linearity we now discuss model validation.This validation is required to avoid the possibility of a chance correlation.Such validation is normally done by experimental as well as regression method.In experimental validation the results are analyzed by using the model itself.The high correlation coefficient the lowest standard deviation and F values significantly grater than 90% are enough to validate the model.If the model satisfies all this requirements then it needs to be further validated using cross-validated parameters.
The estimation of probable error of coefficient of correlation (PE) is the first requirement for validating the method.This is defined as below: Where r (or R in multiple correlation) is the correlation coefficient and n is the number of compounds under study.It is argued that: (i) if r (or R) < PE, then r (or R) is not significant; (ii) if r > PE, several times; at least 3-times grater correlation is indicated, and (iii) if r (or R) > 6PE, then the correlation is definately good.
The 6 PE data presented in Table 9 indicate that all the proposed correlations are good.In cross-validation method validation is carried out on the basis of cross-validated parameters: PRESS (Predicted residual sum of squares), SSY (Sum of the squares of the response value), r 2 cv (overall predictive ability), S press or S cv (uncertainty of prediction), and PSE or S pred (predictive square error).The calculated values of parameters are shown in Table 9.We observed that in all the cases PRESS < SSY indicating that the models predicts better than chance and can be considered statistically significant.Except model 1 and 6, all other models have the ratio of PRESS/SSY smaller than 0.4 indicated them to be quite good models.Furthermore, for models-2, 3, 4 and 9 this ratio is much smaller than 0.1 that indicates all these models are excellent models.This is further confined by the values of R 2 cv, S PRESS , and PSE.It is important to mention that PSE is more directly related to the uncertainty of the prediction and is important in those cases in that S PRESS coincide to Se.Finally, we will like to make comments on R 2 A. It takes into account of the adjustment of R 2 .If a variable is added that does not contribute its fair share, then the R 2 A value declines.It is a measure of the % explained variation in the dependent variable that takes into account the relation between the number of compound and the number of independent variables in the regression model.R 2 A will decrease if the added variable doesn't reduce the unexplained variation enough to set the loss of degrees of freedom.A perusal of Table 5 shows that in all cases discussed above R 2 A goes on increasing with the added variables.

Conclusions
From the results and discussion made above we conclude the following: (1) The development of the QSAR model on murine recombinant isoenzyme mCAXIII is rigorous and formally unexceptional, especially the choice of descriptors used by us appears particularly appropriate.The final model is predictive and the analysis can give precious hints for the understanding of is enzyme inhibition mechanisms.
(2) In spite of fact the isoenzyme m CAXIII is available in guile-limited amount, and that database is rather limited, this first QSAR study based of 14 compounds could be very useful for examining its inhibitory power; (3) logK i (mCA XIII) could be best modeled by topological indices, both in simple as well as multiple regression analysis; (4) The structural descriptors can also be used successfully for modeling logK i (mCA XIII).However, the resulting models are inferior to those obtained using topological indices; (5) The combinations of topological and structural descriptors do not yield models better than those, which are obtained using topological indices alone; (6) In topological modeling of logK i (mCAXIII) Balaban type indices play a dominating role.They in combination with W and 1 χ v yield excellent multi-parametric models for modeling logK i (mCAXIII), and

Table 1 .
Structural details of carbonic anhysrases used for modeling logK i (mCAXIII)

Table 2 .
Various topological descriptors and inhibition activity: log (mCAXIII) used in the present study and their values

Table 3 .
Balaban and Balaban type indices used in the present study

Table 4 .
Values of physicochemical parameters calculated for compounds (

Table 5 .
Regression parameters and quality of correlation

Table 6 .
Correlation matrix for the best tri-and tetra-parametric models

Table 8 .
Actual and predicted values of logKi (mCAXIII) their residue