Topological modeling of antimycobacterial activity of 3-formyl rifamycin SV derivatives

The paper describes topological modeling of antimycobacterial activity of 3-formyl rifamycin SV derivatives using a large series of molecular vis-à-vis topological descriptors. For the set of 53 derivatives of 3-formyl rifamycin SV no one variable model is possible, however, in multiparametric regression excellent model is obtained for modeling the activity. The results are discussed using variety of statistical parameters


Introduction
Rifamycins are a group of chemically related antibiotics obtained from Streptomyces mediterrani.They belong to a new class of antibiotics that contain a macrocyclic ring bridged across two non-adjacent (ansa) portions of an aromatic nucleus and called ansamycins [1].The rifamycins and many of their semi-synthetic derivatives have a broad spectrum of anti-microbial activity [1,2].They are most notably active against gram-positive bacteria and Mycobacterium tuberculosis.However, they are also active against some gram-negative bacteria and many viruses.They form a class of antibiotics with a specific potency as drug against tuberculosis via inhibition of the DNA-dependent RNA polymerase [3].Rifamycin SV (which lacks a C-4 constituent) and the glycolic acid linked at C-3 has antibacterial activity.3-formyl rifamycin derivatives (Table 1) are one of the classes of anzamycins widely used against infections caused by ordinary bacteria, tuberculosis and leprosy [4].In the present study the antibacterial potency, log(MIC RIA / MIC X ), of this class of compounds (Fig. 1, Table 1) against Mycobacterium tuberculosis are subjected to a QSAR analysis using a large set of topological indices (Tables 2  and 3).In these tables as well as in the text this activity is shown as logA.The QSAR modeling is then performed by maximum-R 2 method using step-wise regression analysis [5][6][7].The results are discussed below.

Results and Discussion
So far, no QSAR studies with rifamycins employing molecular descriptors mentioned in Table 3 were used to quantify and elucidate potentially relevant chemical reactivity patterns of the drugs.The literature data on the antibacterial potential of 3-formylrifamycin SV derivatives (Table 3) was used for preparing models with excellent statistics.The correlation of the antibacterial activity with the molecular descriptors used is given in Table 4.
A preliminary regression analysis (Table 5) has indicated that none of the molecular descriptors used singly is capable of modeling the activity.However, the data presented in Table 5 did show that the variable BIC4 (Bond information content, neighborhood symmetry of 4order) is the promising descriptor to be used in multiparametric regression analysis.It means that multiparametric model(s) will invariably contain this BIC4 as one of the correlating parameters.
Before a multivariate analysis is undertaken it is convenient to normalize the data in certain ways in order to make the detection of significant correlations easier.Normally, it is sufficient to preprocess the data by means of auto-scaling and mean-centering the variables.Auto-scaling gives each variable unit variance and hence the same chance to contribute to a estimated model, while mean-scaling facilitates interpretation.This can be achieved by obtaining correlation matrix.Such a correlation matrix, as stated earlier, is presented in Table 4.An examination of the correlation matrix (Table 4) shows molecular descriptors used did exhibit linear correlation.That is, model containing such descriptors will suffer from the defect due to collinearity, which statistically is not allowed.Such cases will be examined using Randic [8] recommendations discussed in the following section.Following maximum-R 2 method [5][6][7], and using a large set of 15 descriptors and the entire set of 53 compounds, we obtained several models containing 1 to 10 correlating parameters (Table 5) and observed that the models contains compounds 27, 31, 33, 43 and 44 as outliers.The deletion of these compounds gave better results (Table 6).A perusal of Table 6 shows that statistically better models start from model-25.A detailed analysis of these models (Table 7) indicates that they contain one or more correlating parameters in that the coefficient of the correlating parameter is significantly smaller than their respective standard deviation.Such models are not allowed statistically.The deletion of such parameters from the models yielded improved models 33-38 as presented in Table 8.Hence, our further discussion will be centered on these six models: 33-38.The data presented in Tables 8 and 9 indicate that the model 38 is the best model for modeling the antibacterial activity.
It is interesting to mention that the model 38 contains 12 correlating parameters.It becomes necessary to examine the model 38 by applying the rule of thumb [9,10] and searching optimum descriptors that can be used in proposing the models for the data set of 53 (reduced to 48 after removing five outliers) compounds used in the present study.The limitations and some common pitfalls of multiple regression analysis were pointed out by Tute [9,10].According to him, there must be a sufficient number of compounds included in the analysis to enable statistical significance to be reached, despite inevitable errors in measurement.A rule of thumb evolved by Tute [9,10] is that the number compounds to be used should be at least three times the number of parameters under consideration.Looking to the data set (48 compounds) and in accordance with the rule of thumb the proposed 12 parametric model is quite justified.In order to confirm this finding we have investigated optimum number of parameters that could be used for modeling the activity of 48 compounds.This we did by plotting graphs between the number of variables and the corresponding R 2 and R 2  A values [Table 10 ] plotted on the same graph.In our case both R 2 and R 2  A go on increasing with the number of variables and becomes almost constant at 12 parameters.This finding is, therefore, consistent with the results obtained by applying the rule of thumb [9,10].Further confirmation is made by calculating the activities from each of the proposed models and comparing them with the experimental (observed) activities.Such comparisons are given in Table 11 and demonstrated in Figures 2-7.The results are in favor of 12 parametric model 38.We have also used the data from Figures 2-7 and obtained correlations between observed and estimated antibacterial activity.This demonstrated by models 39-44 (Table 12), which finally confirmed that the proposed 12 parametric model 38 is the most appropriate model for modeling the activity.-1.053 -0.20213 -0.59346 -0.66626 -0.61027 -058658 -0.64871 3.
0.50000 0.00000 -0.50000 -1.00000 log A (Estimated)     It is worthy to mention that the aforementioned results and discussions are enough to establish the goodness of fit, but none establishes the goodness of prediction.The proposed models should be excellent model only when they have both excellent fit and excellent predictive power.The latter is now investigated by using the method of cross-validation [5][6][7].The various cross-validated parameters estimated for the models 33-38 are given in Table 13.All the crossvalidated parameters, except S PRESS , are in favor of the proposed model.From Table 9 and 13 we observed that Se is the same as S PRESS and thus the latter parameter cannot be used in deciding the uncertainty of prediction.In such cases the uncertainty in prediction is judged from yet another cross-validated parameter viz., PSE.The lowest value of PSE decides the uncertainty of prediction.Needless to state the PSE is smallest for the proposed model.Hence, we can conclude that the proposed model 38 has significant fit and predictive power.
Further examination of the data presented in Table 13 indicates that in all the six models PRESS < SSY indicating that these models predict better than chance and thus they can be considered statistically significant.Furthermore, the ratio PRESS / SSY for the model 38 is smaller than 0.4 (0.3378) indicating it to be reasonable QSAR model.At this stage, it is interesting to comment on R 2 A , which accounts for the adjacent of R 2 .It is a measure of the % explained variation in the dependent variable that takes into account the relationship between the number of cases and the number of independent variables in the regression model.Whereas, R 2 will always increase when an independent variable is added.R 2 A will decrease if the added variable doesn't reduce the unexplained variation enough to offset the loss of degrees of freedom.If a variable is added that does not contribute its fair share, the R 2 A will actually decline.A perusal of Table 10 shows that as we pass from a ten parametric model to 12-parametric model, R 2  A go on increasing indicating that in each case the added parameter has enough contribution to the proposed model.
From the data presented in Table 8 we observed that all the six models contain one or more linearly correlated parameters.Thus, statistically they suffer from the deflect due to collinearity.However, such a problem was thoroughly investigated by Randic [8].We have, therefore, used Randic recommendations to resolve the problem arising from co-linearity.Randic [8] stated that selection of the descriptors to be used in structure-property-activity studies should not be delegated solely to the computers although statistical criteria will continue to be useful for preliminary screening of the descriptors taken from a large pool.Often in an automated selection of descriptors a descriptor will be discarded because it is highly correlated with another descriptor already selected.But what is important is not descriptor parallel to one another, that is, duplicate much of the same structural information but whether they in those parts that are important for structure-property-activity correlation.If they differ in the domain, which is important for the property / activity considered both descriptors should be retained.If they differ in parts that are not relevant for the correlation of considered in parts that are not relevant for the correlation of considered property / activity that one of them can be discarded.Therefore, following Randic [8] all the six models can be considered statistically significant.In this regard it is worthy to mention that some of the most obvious problems of severe multicollinearity are as follows: (1) Incorrect size of the coefficients, (2) A change in the values of the previous coefficient when a new variable is added to the model, (3) Change in insignificant of a preciously significant variable when a new variable is added to the model, and (4) An increase in the standard error of the estimate when a new variable is added to the model.
In the proposed models (33-38) none of these problems occur.Furthermore, all the variables occurring in the model have coefficients, which are significantly larger than their respective standard deviations.In view of the aforementioned discussion all these models are considered statistically significant.
In order to finalize our results it is worthy to comment on the degeneracy of the molecular described in the present study.A perusal of Table 3 shows that low to high degeneracy is present in all the molecular descriptors used.This due to the fact that these descriptors belong to first and second generation descriptors , which in spite of their degeneracy are quite useful in QSPR and QSAR studies [15].In our case the degeneracy problem has become more actuate due to the use of descriptors LP1 and X5AV.Out of the 53 compounds used in the present study LP1 is found to be the same (2.639)value for as many as 49 compounds.While in the case of X5AV, its value is not widely varied, it ranges between 0.022 and 0.025.Furthermore both these parameters are involved in all the six (33-38) statistically significant models.However, we observed that use of these parameters in the proposed models is well justified due to increase in R 2  A upon their addition as the correlating parameters.Furthermore, in all the six models these parameters have coefficients very much larger than their corresponding standard divisions and those models are statistically allowed.However, it seems beneficial to confirm our results by further performing regressions without considering LP1 or X5AV or both.If, under such a study the quality of the regression is improved, then it will be better to do modeling without these parameters, otherwise not.When we did so (Tables 14-19) we observed that under such study the models become quite inferior without the use of these parameters.All these results, therefore, justifies the use of these two parameters in all the models proposed by us.

Conclusions
From the results and discussion made above we conclude that antimycobacterial activity of 3formyl rifamycin SV derivatives can be modeled using a twelve-parametric model which contains variety of molecular descriptors including distance-based and connectivity indices.The results obtained here in will be useful for pharmaceutical as well as medicinal chemists to synthesis new drugs having still better antibacterial potential

Experimental Section
(1) Antimycobacterial activity: The antimycobacterial activity expressed as log(MIC RIA /MIC X ) for different strains against Mycobacterium tuberculosis were taken from the literature [4].For the brevity this activity in all the tables as well as in the text is expressed as logA.Further details are available in [4].
(2) Molecular descriptors: All the molecular descriptors used for proposing statistically significant models were calculated using DRAGON Software [11] .The structure optimization was performed using ACD Labs [12] and HyperChem [13] software's.
(3) Statistical analysis: All the statistical analyses were performed using SPSS Software [14].

Figure 1 .
Figure 1.General structure of the compounds used in this study.

Table 1 .
Structural details of the compounds used in the present study

Table 2 .
Topological descriptors used in this study

Table 3 .
Observed activity (logA) and topological descriptors used in this study

Table 4 .
Correlation matrix for the activity and the descriptors used in this study

Table 5 .
Model summary considering all the 53 Compounds

Table 6 .
Model summary for the set of 48 compounds after deleting five compounds( 26,31,33, 43 and 44) as outliers

Table 9 .
Regression parameters for the proposed models

Table 13 .
Cross validation parameters for the proposed models

Table 19 .
Cross Validation parameters for models 33 -38 after deleting LP1 and X5AV