QSAR Studies on amino-succinamic acid derivatives sweeteners

This paper presents some QSAR studies realized with the PRECLAV computer program. The database contains 3-amino-succinamic acid derivatives. The outlier molecules included in the calibration set were identified with a specific criterion. Without outliers N = 121, s = 0.3989, r = 0.8468, F = 128.2, KCV = 0.7700. The virtual molecular fragments that lead to a significant increase of the sweetness power SP are -CN (cyano) and -C6H4-NHCONH(aryl-substituted urea). The non-conjugated or weakly conjugated virtual fragment -NH2 leads to a significant decrease of the SP value. The sweetness power is favorably influenced by the size of the molecule. With a view to external validation, the calibration set includes 97 molecules (s = 0.4248, r = 0.8301, F = 89.9, KCV = 0.7560) and the validation set includes 24 molecules (s = 0.3836, r = 0.8580, K = 0.7609). Identification of molecules in validation set with high estimated value of SP is correct enough to have practical value, even if the calibration/validation set contains 3-amino-succinamic acid derivatives with very different chemical structures.


Introduction
[5][6][7][8][9][10][11][12][13][14][15] These researches focus on QSAR (Quantitative Structure Activity Relationship) studies, molecular modeling and conformational analysis and aim to establish the structural features of aspartyl-dipeptide derivatives that favour the activation of sweet taste receptors and to predict the sweetness potency of such molecules.Recent studies on molecular bases of sweet taste in the family of sweet taste dipeptide ligands revealed the fact that some Narylalkyl and other substituted aspartyl dipeptide derivatives have a sweetness potency of 1.000-50.000higher than sucrose.This property could justify intensive research with the aim to obtain a compound close in quality to the ideal artificial sweetener: very high sweetness potency, pleasant taste, lack of toxicity, water solubility, thermal and chemical stability.
The following definitions are used throughout this article: calibration set -a group of molecules that contains molecules with a known structure and known values of the sweetness power, a group of molecules used in calculating the QSAR; in the literature there are also used as synonyms "learning set" or "training set" validation set -a group of molecules that contains molecules with a known structure and known values of the sweetness power, a group of molecules that is NOT used in calculating the QSAR; the equation obtained using the calibration set is used for calculating the sweetness power of the validation set; the concordance between the calculated values and the experimental values for the validation set is a measure of the accuracy of the algorithm (program) used.prediction set -a group of molecules that contains molecules with a known structure and unknown values of the sweetness power; this group includes new structures, not synthesized yet; due to the fact that the values of the sweetness power for these molecules are unknown, the prediction set cannot be used in calculating the QSAR or for validation purposes; in literature we can find as synonyms "testing set" descriptor -any molecular characteristic whose value is to be calculated predictor -a descriptor present in the highest quality equation, the only equation used for prediction his article presents the results of some QSAR studies done without a prediction set, with a calibration set and a validation set including derivatives of the 3-amino-succinamic acid.

Methods and formulae
The dependent property was sweetness power SP = Log(1+RS), where RS is the relative sweetness to sucrose.The starting point of the computation was the database (136 derivatives of the 3-amino-succinamic acid) shown in Table 1, with the values of RS found in literature (see last column).Where two bibliographic sources indicate different values of RS for the same molecule, the highest value was taken into account.The configuration of the asymmetrical carbon atoms marked in the table refers to the configuration of the asymmetric carbon atom in the picture and the asymmetric carbon atoms from the R 2 group.he molecules have been virtually constructed using the molecular mechanics program, PCMODEL. 16The geometry of the minimum energy conformer was obtained with the MMX force field and conformational analysis with GMMX algorithm included.Afterwards, the geometry was more rigorously optimized with the quantum mechanics program MOPAC, 17 using the keyword string: "pm3 pulay gnorm=0.01shift=50 geo-ok mmok camp-king bonds vectors".The computations have been done for the neutral molecules, not for the cations, anions or zwitterions, although these ionic species are, probably, present in the aqueous solution.Thus the errors in geometry optimization are smaller and the predictive power of QSARs is better.
For the statistical computations an improved version of the PRECLAV 18,19 (Property Evaluation by Class Variables) has been used.The output files created by MOPAC for each analyzed molecule are input files for PRECLAV and they contain the values of some descriptors.Using the data from the files generated by MOPAC, PRECLAV has computed most of the descriptors and has done the statistical analysis.
We have used only "whole molecule" PRECLAV descriptors. 31The methods for identifying "significant" descriptors, the quality criteria for the descriptors, the method for grouping the significant descriptors in sets and the quality criteria for the calculated QSARs have been presented in previous work. 18,31In the situation where there is no validation/prediction set, the "significant" descriptors are the descriptors that are sufficiently correlated with the dependent property (r 2 > 4/N, N is number of molecules in calibration set).The computed QSARs are multilinear.
Weighting factors c k of predictors p k are computed by Ordinary Least Square Method.

Identification of the outlier molecules
The outlier molecules are molecules for whom the QSAR resulted from computations offers only a poor estimation of the sweetness power, although for the rest of the molecules in the calibration set the estimates have been good.The presence of the outlier molecules lowers the predictive © ARKAT quality of the whole calibration set and often determines the inclusion into the final equation of a different set of predictors.In order to identify these molecules, PRECLAV uses a specific criteria, called COIN (Combined Outlier INdex).COIN value is the product of two factors: (2) where: O value is a usual criterion for identifying outlier molecules based on comparing the calculated and the experimental values of the dependent property standard error of estimation N is the number of molecules in the calibration set p is the number of predictors O rank is an identification criterion for the outlier molecules based on comparing the rank of the molecules in the set ordered by the calculated or experimental values If the difference ∆ value between the calculated and the experimental value is big -compared with the standard error of estimation s -then O value will be large.The molecule for which the value of O value > 1 may be considered "outlier by residue".If the difference ∆ rank between the calculated and experimental values of the ranks is big -compared with the number of degrees of freedom of the equation -then the value of O rank is also big.The molecule for which the value of O rank > 1 may be considered "outlier by rank".Some molecules in the calibration set may give very high values of O value , but low values of O rank or the other way round.Here only the molecules with COIN > 1 have been considered outliers and have been eliminated from further computations.

Obtaining the homogenized calibration set
The homogenized calibration set has been obtained by eliminating the outlier molecules from the original database.
Identifying the significant virtual molecular fragments PRECLAV divides the analyzed molecules into virtual fragments , using an algorithm presented in some previous papers. 20,21Two bonded heavy atoms (by a chemical bond of bond order value B) are included into the same fragment if B > k (k limit value depend of computation method for B).The classic conjugated functional groups form, according to the program, a single fragment, of higher mass.Any fragment identified by the program is non-conjugated or has low conjugation with the neighboring fragments.The virtual fragments identified by PRECLAV do not always coincide with the classical functional groups.

ARKAT
For each fragment the program calculates the weight ratio in the analyzed molecule.For a certain fragment, within the N molecules in the calibration set, the ratios p 1 , p 2 , … ,p N are calculated.The program calculates the linear correlation r between the values of p k ratios and the values of the sweetness power.If the value of r 2 is greater than a predefined limit, then the fragment is considered as "significant".The presence of a significant fragment in the molecule greatly influences (in a positive of negative way) the sweetness power of the molecule.The program calculates a type (1) QSPR equation where the variables (maximum 10) are the p k ratios of the significant fragments.The value of the sum in equation ( 1) is, in this case, the value of the descriptor "QSPR of mass fragment percents".

Identification of the parabolic descriptors
PRECLAV calculates for the sweetness power SP multilinear equations of type (1).Although it is difficult to believe that the sweetness power dependency on various molecular features (descriptors) is linear, these equations have a reasonable predictive value.This is due to the following facts: -the presence of a certain number of predictors in the equations (the errors induced by each may compensate) -the intrinsic nonlinear character of some predictors -acceptable approximations, at least on certain intervals, of some nonlinear functions (exponential, logarithmic, hyperbolic, etc.) by linear functions If the dependence of the sweetness power on a certain descriptor is parabolic, and the parabola has a peak point (maximum or minimum) between the minimum and maximum value of the descriptor, then the linear function of the descriptor has low predictive abilities.In other words, if there is an optimal value of the analyzed descriptor from the point of view of the sweetness power, the linear function is incapable of finding this optimal value.The quality q lin of the linear function and the quality q par of the parabolic function of certain descriptor are calculated in a specific way by PRECLAV. 18The replacement of the linear function with a parabolic function is done only if this replacement leads to a significant increase in the quality of the descriptor, according to (5): q par > 1.5 • q lin (5) The relative utility of the predictors In the case when there is no validation/prediction set, the program calculates the relative utility U of the predictors, using equation ( 6): U = (R 2 -r 2 ) / (1 -r 2 ) (6) where: R 2 is the square of the Pearson correlation between the experimental and calculated values of SP (values calculated using an equation with p predictors) ISSN 1424-6376 Page 31 © ARKAT r 2 is the square of the Pearson correlation between the experimental and calculated values of SP (values calculated using an equation with p-1 predictors, that is the equation that does not contain the analyzed predictor) After calculating the value of U for all the predictors, these values are normalized according to the highest U (the highest value becomes 1000).The predictors with a high value of U (U>400) may be considered very useful in calculating the sweetness power SP.These predictors are useful as they correlate very well with SP and do not correlate with the other predictors.Each "useful" predictor explains (quite) a lot of the SP variation and, in the same time, a different thing as the other predictors.

Validation of the computation procedure
For the validation of the method, we have proceeded to a QSAR study with a validation set and reduced calibration set.
The validation set was extracted from the homogenized calibration set.For the extraction, the molecules in this set have been ordered according to the observed value of SP, starting from the lowest values.For equal values, the order was arbitrary.In this set, the molecules with rank 3, 8, 13, 18, 23, 28, 33, …, etc. have constituted the validation set.The remaining molecules form the reduced calibration set.The validation set includes approximately 20% of the molecules in the homogenized calibration set.We can assume that the reduced calibration set obtained in this way is a representative sample for the homogenized calibration set.The quality of the prediction for the validation set was considered a measure of the quality of the computation method.
The external validation procedure presented here must not be mistaken with the LOO (Leave One Out) internal cross-validation method used by PRECLAV in various stages of the computation.

Identification of the outlier molecules
In order to identify the outlier molecules, an initial QSAR study had been done using as calibration set the initial database.

QSAR study #1 Calibration set: 136 molecules (all Table 1 molecules) Validation set: none Number of significant descriptors: 251
The type (1) QSAR for prediction: The lowest correlation with SP is calculated for predictor p 4 (r 2 = 0.0945).The highest intercorrelation between predictors is calculated for the pair p 1 , p 4 (r 2 = 0.3359).
Considering the values of s, r 2 , K and K CV the estimative quality of SP for the whole molecular set is mediocre.
All practical QSAR studies are faced with the problem of correctly identifying the outlier molecules in the calibration set.The presence of these molecules makes it difficult to obtain results with practical value, respectively: a) correct identification of the molecular characteristics with high influence over the dependent property, very necessary in drug design b) correct identification in the prediction set of the molecules "recommended for synthesis" The process of identifying and eliminating the outlier molecules precedes any other computation in "drug design" procedures.Consequently, the authors have considered that QSAR studies (with or without a validation set) will lead to erroneous results if the calibration/validation sets contain outlier molecules.Studies #1 and #2 have been performed in order to identify these outlier molecules and to underline the effects of their removal from the calibration set.
In the group of molecules in Table 1, 15 outlier molecules have been identified (COIN >1), that is 11% of the molecules in the calibration set.The index in Table 1 corresponding to these molecules has been marked with asterisks (first column in Table 1).
The outlier molecules are: 10 -the only aromatic amido group that has R 1 = H 25 -the only furyl group, with a very high SP value 13, 14, 15, 16, 45, 132, 133 -quite plain in structure, but with very small values of SP 49, 57, 58, 76, 78, 117 -quite plain in structure, but with very high values of SP ARKAT None of the molecules containing R 1 = X 3 C-CO has been identified as outlier.Also molecule 114, the only diacid in the set, hasn't been identified as outlier.

Obtaining the homogenized calibration set
The homogenized calibration set (121 molecules) has been obtained by eliminating the 15 outlier molecules from the initial database.

Identification of the significant virtual molecular fragments
Another QSAR analysis has been performed, using the homogenized calibration set.

QSAR study #2
Calibration set: 121 molecules (Table 1 molecules The lowest correlation with SP is calculated for predictor p 2 (r 2 = 0.0383).The highest intercorrelation between predictors is calculated for the pair p 1 , p 5 (r 2 = 0.3505).Considering the values of s, r 2 , K and K CV the estimative quality of SP for the whole molecular set significantly improved.The decrease in scattering of the experimental/calculated values after eliminating the outlier molecules is obvious (Fig. 1 and Fig. 2).

Figure 1
Figure 2 Scatter plot before elimination of outliers Scatter plot after elimination of outliers Katritzky obtain 70 , for a certain peptides group (N = 87), a sweetness QSAR with a moderate predictive power (p = 5, r 2 = 0.6890, F = 35.7).The predictors in Katritzky QSAR are "total entropy", "maximum partial charge", "number of chlorine atoms", "maximume-n attraction in C-O bonds" and "minimum net atomic charge of N atoms".Also is reviewed the molar volume contribution to the sweet taste.
We must underline the fact that the group of predictors in equation #1 is very similar to the group of predictors in equation #2.This suggests the fact that the outlier molecules belong to the same class as the other molecules.The fact that they are labeled as "outliers" might be owed not to a difference in the mechanism of the biochemical action, but, possibly, to an erroneous SP value used in the computations.
In the group of 121 molecules, the program identifies only 33 virtual fragments.We can say that there is not a significant variety from the structural point of view.On the other hand, there have been identified more than 10 significant structural fragments.
The virtual fragments that lead to a significant increase of the SP value are -CN (cyan) and -C 6 H 4 -NHCONH-(aryl-substituted urea).The last one is the fragment present in (for instance) molecules 114 -127.Due to the fact that the molecules containing a -CN group have it only in the para position, we can not evaluate the influence of the position of this group on the aromatic nucleus.
A reduced favorable influence on the SP value is due to the presence of the -C 6 H 4 -, -NH-(non-conjugated with neighboring groups) and -C 6 H 4 O-fragments.
The virtual fragments that lead to a decrease in the values of SP are -NH 2 , -NHCO-and -COOH, non-conjugated with the neighboring fragments.
All molecules (with the exception of 114) contain only one -COOH group.Consequently, the unfavorable character of the "high proportion of COOH fragment" predictor must be interpreted as an unfavorable character of the "low molecular mass" predictor.The same ARKAT arguments apply to the presence of the -NHCO-fragment.The favorable character of high molecular mass is underlined also by the presence in equations #1 and #2 of highly "useful" predictor "Platt topological index".
A reduced unfavorable influence is shown also for the fragments -Cl and >CH-(branched alkyl).

Identification of the parabolic descriptors
Equation #1 does not contain any parabolic descriptors.In QSAR study #2 only two parabolic descriptors are found on the list of 30 descriptors with the highest "signification".In the final equation only descriptor p 2 is a parabolic function, and its calculated utility is rather reduced.We can conclude that the parabolic functions have lost the mathematical competition with the linear functions.The later are sufficient to describe the sweetness power of the molecules in Table 1, at least in the case when there is no validation/prediction set.

Utility of predictors
By the way it is defined, [22][23][24] the Platt topological index is simultaneously a measure of the order and dimension of the molecular graph, that is of the size of the molecule and the degree of chain branching.For equal size (equal number of edges) the calculated value of this topological index is higher for the branched molecular graph (higher number of vertices with rank >2).Of the two graphs of equal size, the branched one is, necessarily, closer to the spherical shape.QSAR study #2 suggests that the weak unfavorable effect of chain branching can be compensated by a higher molecular mass.
The presence of predictor p 2 (in #1), respectively p 3 (in #2) suggests that the structure of virtual fragments has a certain influence over the sweetness power.
The calculated value for the free valence of N atoms is sensibly higher in the cyano groups than in the amide or amino groups.The presence of predictor p 3 (in #1), respectively p 4 (in #2) confirms the importance of -CN group in molecules with a high value of SP.

Validation of the computational method
Using the procedure described above a validation set of 24 molecules has been extracted from the homogenized calibration set.These molecules have been marked in Table 1 with bold characters.The remaining molecules (97 molecules) have formed the reduced calibration set.The reduced set and the validation set have been used in another QSAR study.

QSAR study #3
Calibration set: 97 molecules ( When working with a validation (or prediction) set, PRECLAV uses a specific criteria for selecting the significant descriptors 18 by using the Class function.The program verifies if, from the point of view of the values of analyzed descriptor, the calibration set (now the reduced calibration set) is a representative sample of the whole molecular set (now the reduced calibration set + the validation set).The selection of descriptors is harsher.The resultant QSAR is different from the case when the validation/prediction set is missing, a thing that differentiates PRECLAV from other programs for QSAR computations.
In the case when there is a validation/prediction set, from the point of view of the program the characteristics of the equation (coefficients, number and type of predictors) used in prediction and the quality of the prediction for the calibration set are of less importance.What is in this case significant is the correlation between the estimated values and the experimental values of the dependent property for the molecules of the validation/prediction set.From the point of view of the considerations discussed here, the most significant is the correlation between the calculated and the experimental values of SP for the molecules in the validation set.
In Table 2 there are listed the calculated values (computed by equation #3) and the experimental values (from Table 1) of the sweetness power for the molecules in the validation set.
The correlation between the experimental and calculated values, presented in Table 2, was measured using some common statistical functions.The standard error of estimation (s = 0.3836), square of the Pearson correlation (r 2 = 0.8580) and the Kendall ranks correlation (K = © ARKAT 0.7609) have values that are comparable with the values calculated for the molecules in the calibration set In the QSAR studies #2 and #3.We can state that the program has calculated for the molecules in the validation set values close to the experimental ones and has ordered the molecules in a sequence similar enough to the real one.
From the practical point of view, the labeling "high value of dependent property" and "low value of dependent property" is very important, a labeling that the program is always making 18 for the molecules in the validation/prediction set.In Table 2 the calculated values identified by the program as "high" have been marked in bold letters, while the values identified as "low" have been underlined.Of the five molecules labeled as "high value", four are the molecules that also have the highest experimental values of SP.Of the five molecules labeled as "low value" three are the molecules that have the lowest experimental SP value.This result, qualitatively correct, suggests that, in a QSAR analysis with prediction set, the program will identify with sufficient accuracy the molecules "suggested for synthesis".

Conclusions
After eliminating the "outlier" molecules from the calibration set, using a specific criteria, PRECLAV is capable of offering information about the virtual molecular fragments that are significant from the point of view of the dependent property ARKAT -the molecular features useful in describing the variation of the dependent property The virtual fragments that lead to a significant increase of the sweetness power of the derivatives of the 3-aminosuccinamic acid analyzed here are -CN (cyano) and -C 6 H 4 -NHCONH-(arylsubstituted urea).
The non-conjugated or weakly conjugated virtual fragment -NH 2 leads to a significant decrease of the sweetness power.
The sweetness power is favorably influenced by the size of the molecule.The linear functions of the descriptors are enough to describe the sweetness power of the derivatives of the 3-aminosuccinamic acid, at least in the case when there is no validation/prediction set.
The tests using a validation set suggest that PRECLAV is capable to identify with sufficient accuracy the new, as yet non-synthesized, molecules with higher/lower values of the dependent property, included in a prediction set.

Table 2 .
Experimental/calculated values of SP for the molecules in the validation set