QSAR analysis of some phthalimide analogues based inhibitors of HIV-1 integrase

A quantitative structure activity relationship (QSAR) study has been performed on phthalimide analogues based inhibitors of HIV-1 integrase to understand the structural features influencing the affinity of these inhibitors towards the enzyme. The compounds in the selected series were characterized by molecular descriptors calculated using the QSAR software Dragon and molecular modeling software ChemOffice 2001. QSAR models were derived by stepwise multiple regression analysis employing the method of least squares. The best QSAR model describing the HIV-1 integrase inhibitory activity of phthalimide analogues was selected on the basis of statistical significance and predictive ability as gauged by cross-validation procedure and external test-set method. The generated QSAR models revealed that increased HIV-1 inhibitory potency of tricyclic phthalimide derivatives could be achieved by increasing the overall lipophilicity of the molecules and by incorporating halogen substituents in the benzylic aromatic ring attached to the phthalimido nitrogen atom. Additionally, the model also suggests that the increase in the molecular flexibility by incorporation of rotatable bonds is conducive for HIV-1 inhibitory activity of phthalimide derivatives whereas increase in molecular branching appears to be detrimental to the activity.


Introduction
The alarming spread of the acquired immuno-deficiency syndrome (AIDS) epidemic has stimulated the discovery of therapeutic agents to inhibit the replication of the causative virus, human immunodeficiency virus (HIV). 1 The advanced understanding of the viral cell cycle has made it possible to define the targets to interrupt the life cycle of the virus.Among them, one such target is the viral integrase (IN), which is responsible for the integration of proviral DNA into the host cell DNA.It catalyzes two distinct reactions: terminal cleavage at each 3' end of the proviral DNA removing a pair of bases and strand transfer which results in the joining of each 3' end to 5' phosphates in the target DNA. 2 As these biochemical reactions are essential for the life cycle of the virus, integrase represents an attractive target for treatment of HIV infections.What makes IN an especially attractive target for drug design is that there are no known mammalian counterparts to this enzyme, so toxicity is expected to be rather low.Furthermore, integration is an essential part of the viral replication cycle and the IN region of the pol gene is more conserved than either the RT or PR coding regions. 3Therefore, the inhibition of IN catalytic activities offers a promising anti-retroviral drug target.While a large number of compounds that inhibit integrase have already been identified, 4 only a handful displayed antiviral activity in cell culture.Currently there are no approved drugs that target IN.However, to date, six integrase inhibitors are under preclinical or clinical studies in AIDS patients namely AR-188, S-1360, L-880,812, JTK-303, FZ-41, MK-0518. 5,6As more IN inhibitors enter human drug trials, there is a growing need for the design of novel lead compounds with diverse structural scaffolds and promising pharmacokinetic properties to counteract the difficulties observed with first-generation IN inhibitors. 7Considering the recent interest in HIV-1 integrase inhibitors and for progression of design and development of such inhibitors, a Quantitative Structure-Activity Relationship (QSAR) investigation of novel series of HIV-1 inhibitory phthalimide analogues reported by Verschueren et al 8 is carried out.A QSAR study is performed on these series in order to analyze the physicochemical and structural requirements of these inhibitors to exhibit optimal inhibitory potency of HIV-1 integrase enzyme which will in turn help in rationalizing the design of these molecules as integrase inhibitors.

Experimental Section
Dataset: The dataset consist of structurally diverse compounds reported for HIV-1 integrase inhibitory activity.The selected series comprises of forty-two tricyclic phthalimide analogues reported by Verschueren et al. 8 (Table 1).The biological activity values of two compounds (41 and 42) in the series are not well defined; hence they were not used in QSAR modeling.Further, Compound 9 and 40 were stereoisomers, since the applied descriptors are unable discriminating stereoisomers, only one of the two compounds preferentially most potent (Comp.No. 9) was considered for QSAR study.The HIV-1 integrase inhibitory activity of compounds in the series is reported as pIC 50 values where IC 50 refers to experimentally determined concentration required to inhibit 50% of integrase strand transfer activity.The compounds in the selected series were randomly divided into two sets with 31 compounds used as a training set in developing regression models and the remaining 8 as validation set in the prediction of biological activity.

Table 1. Structural modification and HIV-1 integrase inhibition data of tricyclic phthalimide analogues
The molecular structures of the compounds in selected series were sketched using Chem Draw ultra module of CS ChemOffice 2001 9 molecular modeling software.The sketched structures were then transferred to Chem3D module for generation of three dimensional structure (3D).The geometries of generated 3D structures were pre-optimized using MM2 force field as implemented in the Chem3D module of CS ChemOffice 2001.These molecular geometries were refined using the quantum chemical program package MOPAC 6.0 applying the AM1 parameterization together with eigenvector following geometry optimization procedure.The gradient norm 0.001 kcal/Å was used to calculate electronic, geometric and energetic parameters for the isolated molecules.The optimized geometries of the molecules were used to compute the necessary quantum chemical descriptors available in the MOPAC server of Chem3D module.Further, the molecular output was also used for the calculation of some selected descriptors available in the software DRAGON. 10The molecular descriptors employed in the present study are summarized in Table 2.
Variable-selection for the QSAR modeling was carried out by stepwise linear regression method using statistical program SYSTAT (version 10.2). 11 The program employs a stepwise technique, i.e., only one parameter at a time was added to a model and always in the order of most significant to least significant in terms of F-test values.Statistical parameters were calculated subsequently for each step in the process, so the significance of the added parameter could be verified.The goodness of the correlation is tested by the regression coefficient (R 2 ), the F-test and the standard error of estimate (SEE).The t-test and the level of significance of each coefficient, as well as the confidence limits of the regression coefficient, are also reported.The squared correlation coefficient (or coefficient of multiple determination), R 2 , is a measure of the fit of the regression model.Correspondingly, it represents the part of the variation in the observed (experimental) data that is explained by the model.The correlation coefficient values closer to 1.0 represent the better fit of the model.The F-test reflects the ratio of the variance explained by the model and the variance due to the error in the model (i.e., the variance not explained by the model).High values of the F-test indicate that the model is statistically significant.The standard error is measured by the error mean square, s 2 , which expresses the variation of the residuals or the variation about the regression line.Thus, the standard error measures the model error.If the model is correct, it is an estimate of the error of the data variance.The t-test measures the statistical significance of the regression coefficients.The higher t-test values correspond to the relatively more significant regression coefficients.
To further validate the model, other tests were performed for the descriptors, the pairwise correlations and the variance inflation factors (VIF).The VIF values, defined as (1-R 2 )-1, were calculated to identify whether excessively high multicollinear coefficients existed among the descriptors; a VIF greater than 10 is indicative of multicollinearity.The Z-score method was adopted for the detection of outliers.Z-score can be defined as absolute difference between the value of the model and the activity field, divided by the square root of the mean square error of the data set.Any compound which shows a value of Z-score higher than 2.5, during generation of a particular QSAR model is considered as outlier The model, which passed the statistical diagnosis with as few descriptors as possible was chosen.When adding of another descriptor in stepwise addition did not improve significantly the statistics of a model, it was determined that the optimum subset of descriptor for QSARs had been achieved.
Besides deriving quantitative models of statistical significance, an important aspect of QSAR modeling is validating the model since a good statistical fit does not guarantee the predictive ability of the model.In view of above, the internal consistency of the selected models was assessed by cross-validation method following a leave-one-out scheme using the in-house program VALSTAT. 12 In this method, one data point is systematically deleted from the dataset and a QSAR model is constructed on the basis of reduced dataset and the model is subsequently used to predict the removed data point.The procedure was repeated until a completed set of predicted values is generated.The validation parameters calculated are squared cross-correlation coefficient (Q 2 ), standard deviation of sum of square of difference between predicted and observed values (S PRESS ) and standard deviation of error of prediction (S DEP ).Q 2 values greater than five and low S PRESS and S DEP values (<0.5) can be considered as a proof of the high predictive ability of the QSAR models.
Finally, the derived QSAR models were used for the prediction of the activity values of the compounds in the test set and the external validation parameter, predictive r 2 (r 2 pred) was calculated for evaluating the predictive capacity of the model.A value of r 2 pred greater than 3 indicates the good predictive capacity of the QSAR model.
In the QSAR models given above, N is the number of data points, R is correlation coefficient, R 2 is squared correlation coefficient, SEE is standard error of estimate, F represents Fischer ratio between the variances of calculated and observed activities, P is the probability value.The figures given in the parentheses with ± sign in the model are 95% confidence limits.The statistical quantities Q 2 , S PRESS , S DEP are based on the leave-one-out method and correspond to cross-validated squared correlation coefficient, standard deviation based on the predicted residual sum of squares and standard deviation of error of prediction respectively.The molecular descriptors used in the selected QSAR models are defined in Table 2.The values of descriptors in QSAR models derived for the series are tabulated in Table 3. Table 4 records the t-values and VIF values of the descriptors in the selected models derived.Pair wise correlations between the descriptors used in the QSAR models obtained for series is summarized in Table 5.The Tables 6  and 7 represent the pIC 50 estimates of compounds in the training and test set of series obtained using model 1 and 2 along with the experimental data.The statistical details of the QSAR model given above speak for its good statistical quality.The R 2 value is above 0.8, which suggest that a good percentage of the total variance in biological activity is accounted by the model.Low value of standard error of estimate (< 0.3) indicates the accuracy of the statistical fit.All the values of the t-statistic are significant which confirms the significance of each descriptor.The F-statistic (on 4 and 25 degrees of freedom) for this model is 26.7 (compared to the critical value of 4.62 at the 0.05 level of significance).The calculated F value for the generated for the QSAR model exceed the tabulated F value by large margin as desired for a meaningful regression.Furthermore, the calculated F value also determines a confidence limit superior to 99% for this model.The correlation matrix given by Table 5 and the variance inflation factors values less than 5 indicates the absence of multicolinearities in the model.The stability of model 2 as judged by leave-one-out procedure is fairly good (Q 2 > 0.6) suggesting that the models will be useful for meaningful predictions.Further support in this regard is obtained from the low values of the cross-validation parameters S PRESS and S DEP .Very good agreement between experimental data and model computation is achieved using Model 1 as expressed in Table 3 and Figure 1.Furthermore, the predictive potential of model is good as judged by the r 2 pred value = 0.62.The predicted activity values for the compounds in the test set, along with their corresponding experimental activity values, are recorded in Tables 6 and 7.The predicted pIC 50 values of the compounds in the test set are in agreement with the corresponding experimental values (Figure 2).The best tetra-parametric equation obtained for modeling HIV-1 integrase inhibitory activity of tricyclic phthalimide analogues comprises of the following descriptor terms MlogP, 13 RBF, 14 Jhete, 15 and nPhX. 14The molecular descriptor MlogP refers to the Moriguchi log of the octanol/water partition coefficient of the molecule and is considered as a measure of lipophilicity of a molecule.The positive coefficient of the descriptor in model 2 suggests that increase in the overall lipophilicity of the molecule will in turn increase the HIV-1 integrase inhibitory activity of phthalimide derivatives.The constitutional descriptor RBF in the QSAR corresponds to rotatable bond fraction in the molecule.The positive term associated with the descriptor in model 2 indicates that fractional increase in the rotatable bonds in the molecule is conducive for the HIV-1 integrase inhibitory activity exhibited by phthalimide derivatives.The functional group count descriptor nPhX in the model represents the number of halogen atoms bonded to carbon atoms in the aromatic ring.The descriptor bears a positive coefficient in model 2, which suggest that increase in the number of halogen substituents in the aromatic ring will lead to corresponding increase in the HIV-1 integrase inhibitory potency of phthalimide derivatives.The last descriptor included into the model is topological descriptor Balaban-type index derived from electronegativity weighted distance matrix (Jhete).The descriptor Jhete carries a negative weight in the model 2, which suggests that molecular branching and presence of highly electronegative atoms in the molecule are disfavored for the HIV-1 integrase inhibition phthalimide analogues.
Interestingly, compound number 3 behaves as a statistical outlier in case of the selected QSAR model.The outlying behavior of the compound may be owing to the fact that it incorporates pyridazinyl moiety in ring A (Table 1) and exhibits low HIV-1 integrase inhibitory potency despite the presence of benzylic ring with a bromine substitution.From the molecular descriptors incorporated in the QSAR model, one may conjecture that molecular flexibility and hydrophobicity predominantly govern the integrase inhibitory activity of phthalimides under study.Molecular flexibility increases with the number of flexible bonds in the molecule and the importance associated with flexible bond might be owing to the fact that they play an important role in the orientation of pharmacophoric groups in the active site of the enzyme.Hydrophobic substituents in the molecule might influence enzyme-drug affinity through non-specific interactions with hydrophobic region in the active site of the enzyme.Furthermore, it appears that the halogen substitution in the phenyl ring plays a significant role moleculeenzyme affinity, a fact reflected in increased integrase inhibitory potency exhibited by molecules with halogen substituents in Table 1 (e.g., Compounds number 22 to 32).Additionally, the negative weight associated with the topological descriptor jhete emphasizes that presence of bulky groups and electronegative atoms in the molecule disfavors the HIV-1 integrase inhibitory affinity of the title compounds.
Summarizing the above discussion, the present study gives rise to QSARs with good statistical significance and predictive capacity for HIV-1 integrase inhibitory activity of phthalimide derivatives.For the dataset of 39 phthalimide analogues with well-defined HIV-1 integrase inhibitory activity, the HIV-1 integrase inhibitory potency appears to be influenced by structural components and overall lipophilicity of the molecule.The interpretation of the generated QSAR revealed that increased HIV-1 inhibitory potency of tricyclic phthalimide derivatives could be achieved by increasing the overall lipophilicity of the molecules and by incorporating halogen substituents in the benzylic aromatic ring attached to the phthalimido nitrogen atom.Additionally, the model also suggests that the increase in the molecular flexibility by incorporation of rotatable bonds is conducive for HIV-1 inhibitory activity of phthalimide derivatives whereas increase in molecular branching appears to be detrimental to the activity.

Figure 1 .Figure 2 .
Figure 1.Scatter Plot between observed activity and predicted activity of Model 2 (training set)

Table 2 .
Classification and description of the calculated molecular descriptors

Table 3 .
Descriptors values used for the formulation of model 1 ©ARKAT USA, Inc.© ARKAT USA, Inc.

Table 4 .
Descriptor t -values and Variance inflation factor (VIF) values for QSAR models

Table 5 .
Correlation matrix for the descriptors used in Model 1

Table 6 .
Leave-one-out predicted values for HIV-1 integrase inhibition (training set)