Correlation of blood-brain penetration and human serum albumin binding with theoretical descriptors

Quantitative Structure-Activity Relationship (QSAR) models were developed for blood-brain barrier and human serum albumin binding for a dataset of drugs where experimental values of both properties were available. All drugs were represented by chemical descriptors calculated from their constitutional, geometrical and topological structure, and quantum mechanical wave function. The obtained linear (multilinear regression) and nonlinear (artificial neural network) models link the drug structures to their reported properties. Also, based on the characterization of the descriptors we suggest additional criteria for the search of new active compounds. Each multilinear model was tested by leave-one-out and ABC methods. The latter method separates the all data points into three sets and predicts for each of them the property values. The former method is an iterative procedure which in each step it excludes one data point and predict its value based on the model rebuilt for the remaining data points. In addition, the predictive ability neural networks were assessed using the validation sets. All drug structures were investigated by conformational analysis in order to find the lowest energy conformers.


Introduction
Precise pharmacokinetic, metabolic and toxicity information for a chemical compound is necessary for preclinical and clinical trials in drug discovery.To enhance the efficiency and costeffectiveness of the pharmaceutical industry, in recent years, numerous in vitro ADME-Tox QSAR model with good predictive power (R 2 =0.80, q 2 =0.75) for 94 compounds.The analysis of selected descriptors indicated the dependence of HSA affinity on hydrophobic interactions, solubility, size and shape of drug-like compound. 22The preponderant contribution of hydrophobic regions of drugs to HSA binding is supported by QSAR models developed by Colmenarejo et al. 23 and Estrada et al. 24 for the same data set with significant correlation (R 2 =0.82 and R 2 =0.83, respectively).An electrotopological state QSAR model created by Hall et al. 25 performed satisfactorily in predicting HSA binding ability (R 2 =0.77, q 2 =0.70) for a set of 94 compounds demonstrating increased binding affinity due to the presence and electron accessibility of fluorine and chlorine atoms as well as aromatic, aliphatic methylene and hydroxy groups. 25Moreover, Deeb et al. 26 and Votano et al. 27 employed a principal component-artificial neural network (PC-ANN) method using topological descriptors and achieved QSAR models with good correlation (R 2 =0.85 for 56 compounds and R 2 =0.90 for 808 compounds, respectively).A significant linear correlation (R 2 =0.85, n=18) between plasma protein binding and hydrophobicity/lipophilicity (logD) for a series of acidic glycine antagonists 4hydroxyquinolones was determined. 28However, Kratochwil el al. using a structurally diverse set of molecules (n=76) demonstrated that the lipophilicity (logD) is rather poorly correlated (R=0.4  for acids and R=0.3 for bases) correlated to HSA binding. 29

Blood brain barrier
The blood-brain barrier (BBB) is a complex membranous system of brain capillary endothelial cells, pericytes, astrocytes, and nerve endings that plays a central role in maintaining the homeostasis of the central nervous system by blocking the movement of all molecules except those relatively small molecules can cross cell membranes by means of lipid solubility (oxygen, carbon dioxide, ethanol, and steroid hormones) and those that are transported by specific transport systems (electrolytes, nucleosides, sugars and some amino acids).An additional cerebral protection from chemicals circulating in blood (endogenous toxins and various xenobiotics) is provided by the metabolic enzyme and active efflux transport systems (ATP binding cassette transporter super-family of proteins, organic anion transport proteins) in the membrane of the endothelial cells. 30n drug discovery and development, understanding of the mechanisms of penetration of drugs and drug-like compounds through the BBB is crucial to design new potential candidates to act on the central nervous system (CNS) and to avoid undesirable side effects of compounds acting in peripheral tissue.
In recent years, numerous in vivo, in vitro and in silico approaches have been applied to determine the penetration potential of drug candidates.For in vivo measurements imaging (MRI and PET), 31,32 autoradiography, 33 cerebrospinal fluid sampling 34 and microdialysis 35 techniques have been used.In vitro assessments have been carried out on isolated brain capillaries, 36 bovine brain microvessel endothelial cells, 37 Caco-2 cell monolayers, 38 MDR-MDCK cell lines, 39 parallel artificial membrane permeability assay (PAMPA) 40 and immobilized artificial membranes (IAMs). 41he blood-brain distribution is commonly expressed as the ratio of the steady state molar concentration of a compound in the brain and in the blood: BB=C brain /C blood During the last decade, a large number of computational models for the prediction of logBB were developed by several research groups (see Table 1).* Genetic algorithm A less commonly used alternative to describe the brain permeation is the determination of the permeability-surface area coefficient (expressed as logPS), which reflects the free extracellular concentration of a chemical compound. 57Recently, a QSAR model of logPS with significant prediction ability (R 2 = 0.87, s = 0.52) for 30 neutral compounds was developed. 58The comparison of mathematical equations for logPS and logBB revealed that the factors that determine logPS are also those that determine logBB. 59Thus, the analysis of QSAR models indicates that lipophilicity and polar surface area are the most important molecular determinants of BBB permeation.Moreover, BBB permeation of passively diffusing drugs might be optimal when logP values are in the range of 1.5-2.7 60 .Additionally, for the successful design of a CNS drug, neutral or basic compounds (pKa=7.5-10.5)with relatively low molecular weight (<450) and small polar surface area (60-70Å 2 ), low number of hydrogen bond donors (<3) and acceptors (<7), as well as reduced molecular flexibility (number of rotatable bonds: <8) tend to be particularly suitable. 60he goal of the current study is to develop QSAR models for two biological properties based on a common chemical data set of drugs.The different models are interpreted in terms of descriptors involved and statistical parameters.Two approaches were taken in order to achieve this goal, namely, the development of multilinear mathematical equations and creation and training of artificial neural networks.Before QSAR development, all drugs were subjected to conformational search in order to find the optimum low energy structures.

Blood-brain barrier partitioning (logBB)
The partitioning of a compound across the blood-brain barrier is experimentally measured as the ratio of the concentration of the compound in the brain, C brain , to that in the blood, C blood : BB = C brain /C blood .The data set investigated contains 60 compounds (mainly drugs) with their corresponding BB values, directly measured 6 that were converted to the logarithm scale for analysis (see Table2).

Binding Affinity to Human Serum Albumin (logHSA)
Serum protein binding is usually given as the percentage of a drug bound to plasma proteins at clinically achievable concentrations.These values are usually obtained using in vitro measurements at several concentrations, resulting in a calculated mean value.An alternative method, designed for more rapid screening, involves the use of immobilized HSA as the stationary phase in an HPLC (High-Performance Affinity Chromatography) procedure.In the present study, we use albumin binding affinity values assayed by HPLC by Colmenarejo et al. 23 were used.The dataset was optimized to maximize the diversity in both structure and physicochemical properties and it contains binding affinity constants k (HSA) for 85 diverse drugs collected in Table 2.The affinity binding constant k was computed from retention time (t) as follows: k = (t -t o )/t o , where t o is the time for passage of a nonretained material.

Structure optimizations and descriptor calculations
In the current study, QSAR (Quantitative Structure-Activity Relationship) models are presented for log BB and log HSA involving theoretical descriptors, which were calculated solely from molecular (drug) structure using the QSARModel program. 61These descriptors can be classified as: (i) constitutional, (ii) geometrical, (iii) topological, (iv) charge-related, (v) quantum chemical, and (vi) thermodynamic. 62The total number of descriptors for each property ranged between 600 and 900 per compound.A conformational search was also performed using the MacroModel software package. 63For the calculations, MMFF94s -a static variant of Merck Molecular Force Field 94 (MMFF94) was used. 64For energy minimization the Polak-Ribiere Conjugate Gradient (PRCG) method with a gradient 0.05 kcal/Å as a stopping criterion was used.Furthermore, for the conformational search, the Monte Carlo Multiple Minimum (MCMM) method was used where 100 steps per rotatable bond and up to 15000 steps per compound were defined. 65Conformers with the minimum potential energy were utilized as the MOPAC 6 input structures. 66Within quantummechanical, semi-empirical calculations, the AM1 67 parameterization was used with the gradient norm setting of 0.01 kcal/Å.For reasons of comparison and completeness, optimization of the structures without prior conformational search was also performed.However, the models derived from these drug structures did not lead to significant QSAR equations and hence these results are not presented.
QSAR modeling (a) Linear approach (BMLR).The Best Multilinear Regression method 68 (BMLR) was used to find the best correlation models from selected non-collinear descriptors.BMLR selects the best two-parameter regression equations, the best three-parameter regression equations, etc. on the basis of the highest R 2 and F values in the step-wise regression procedure.The result obtained by BMLR is the "best" linear representation of the property in the given descriptors pool.
(b)Nonlinear approach (ANN).An artificial neural network is a biologically inspired computer algorithm designed to treat the data in a manner emulating the learning pattern in the brain.The computer -based network accepts a set of input values, transforms these, and generates an associated set of output values. 69Through an iterative "learning" process, the network refines the information derived from the input values (descriptors) in order to reproduce an associated set of property values (experimental logBB, logHSA).Once a network has been trained to recognize the underlying theme for a given set of input/target pairs, it may be used to predict an output value corresponding to a new group of input values.
In this case, the signal emerging from the output neuron represents the current calculated log BB or log HSA value.When this value is compared to the desired (experimental target) property, a measure of the network error can be calculated.The root-mean squared error (RMS) was used to quantify the effectiveness of the network.The data are repeatedly passed through the network, with the overall error successively decreasing as the network adjusts the weights and biases to reflect the structure -logBB or structure-log HAS relationship.
A backpropagation network was used along with the delta rule for optimization of the weights. 9In Figure 1 a typical structure of a fully connected neural network is shown with neurons that accept inputs I i .All I i form the total input for a neuron, Net, weighed by the connection weights w.Each neuron calculates a single output, Out, using Net transformed by a sigmoidal function.On the basis of the steepest descent method, this technique optimizes the connection weights by proceeding in the direction that most reduces the error of the estimate.In the training process the weights w are iteratively adjusted by an appropriate amount ∆w, which is proportional to the network error with respect to each weight.A neural network is trained, i.e. the neural network learns from a predefined group of compounds known as the training set.One potential problem associated with the use of a network technique is that it may overtrain i.e. derive a relationship which is too specialized.This problem is averted by delegating a portion of the compounds to serve as "a validation set".Because this set of compounds has no direct influence on the actual learning process, it can be used to monitor the predictive capability of the network at regular intervals during a training run.By tracking the validation set error, the optimal set of weights (and biases) to be used as the final predictive model can then be identified.
In order to find the most important descriptors as inputs to the net, a sensitivity analysis was performed on a preselected descriptor space, based on the lowest (RMS).This space was formed after applying the following criteria for reduction of the total descriptor space: i) all descriptors with variance less than 10 -4 were excluded, ii) descriptors which do not correlate with the property more than r 2 =0.2 were excluded and iii) by chemical inspection of certain irrelevant descriptors (for instance, descriptors related to the internal entropy and enthalpy are less likely to be connected to the investigated property).
An important stage of the modeling process is to define the proper architecture of the ANN models.Several different architectures of ANN models were built for each property in this work.In the search for optimal ANN architecture the lowest possible number of neurons was sought, in order to follow the common principle of generality of the ANN prediction.In addition, the RMS for each different architecture was monitored (regarding the hidden units in the hidden layer) so that to select the one with the lowest RMS.The number of layers was chosen to be three-fold based on the basis of common practice for QSAR ANN modeling 70 and by taking into account the number of data points so as to reduce the possibility of overfitting during the training stage.

ABC validation of the MLR models
To validate the multilinear models, the data was sorted in ascending order according to the experimental values, three subsets (A, B, C) were then formed: the 1 st , 4 th , 7 th , etc. data points comprise the first subset (A), the 2 nd , 5 th , 8 th , etc. comprise the second subset (B), and the 3 rd , 6 th , 9 th , etc. comprise the third subset (C).The three training sets were prepared as the combinations of any two subsets (A and B), (A and C), and (B and C), respectively.The tested MLR model was then rebuilt for each of the training sets with the same descriptors but optimized coefficients, and used to predict the property values of the respective C, B and A subsets.The prediction was assessed based on the R 2 between the predicted and experimental property values. 71n addition to the ABC validation, the standard leave-one-out (LOO) cross-validation (R 2 cv ) for all developed models was used.

Results and Discussion
The BMLR algorithm was used to generate several multilinear equations for each property with the number of descriptors between 2 and 7.The final equations were selected by taking into account several factors in order to obtain significant models: (i) the number of descriptors in the models should follow the basic "rule of thumb" i.e. not less than five data points per descriptor and (ii) relevance of the descriptors toward the nature of the phenomenon under investigation and (iii) statistical feasibility of each additional descriptor (in terms of ∆R 2 , ∆F and t-test value).

BMLR model for logBB
The MLR model obtained for logBB is presented as Eq.1 together with its statistical parameters and descriptors.This is the best equation found for the current data which had a significant quality of fit in terms of R 2 = 0.81.The notations along Eq. 1 are as follows: N is the number of data points; k is the number of descriptors; R 2 is the squared correlation coefficient; R 2 cv is the squared cross-validated correlation coefficient (LOO); F is the Fisher's criterion; and s 2 is the squared standard error.Also, the standard errors of the regression coefficients are also indicated in the equation.
The experimental and predicted values of logBB together with their prediction bands at 95% confidence level (denoted as dashed lines) are shown in Figure 2. The numerical values of these data are shown in Table 2.As can be noted from Figure 2, there are two outliers according to these bands i.e.Methotrexate (underestimated) and Indomethacin (overestimated).Equation ( 1) possesses six descriptors (D1-D6) that are listed in Table 3.According to the tstatistics of the descriptors, the most significant of them is D1 -HA dependent HDCA-2 (AM1) (all).It is related to the hydrogen donor ability of the drug and the respective partial charge on the hydrogen surface area.The definition of this descriptor takes into account as many polar hydrogen atoms as there are possible hydrogen bond acceptors (N, O, S, P, F, Cl, Br and I atoms).Increasing the values of this variable leads to decreased logBB values.Alone, this descriptor correlates with the experimental logBB with R 2 = 0.37.Its maximum and minimum values are 0.0984 and 0.00291, respectively.Hydrogen bonding was reported 11 to be one of the most important factors for the blood-brain barrier penetration.Further, the next pair of descriptors D3 and D6 reflects the influence of the presence of N atoms in the drug molecules on logBB.Almost all compounds (except 3 compounds) of the current data set for the log BB include N atoms; hence the related descriptors appear in the QSAR model.The maximum net atomic charge for N atoms (D3) is the second statistically most significant descriptor according to the descriptor t-value.Its value range is within the interval [0.00915, -0.125].As can be seen from this range, the sign of this descriptor value changes depending on drugs compound.The number of amino groups D6 varies from 0 to 3.
The next group of descriptors D2, D4 and D5 (with numerical ranges [0,3], [-11.46,-8.77], [0, 100.6]) is connected to the short-range and hydrophobic interactions of the drug molecules while penetrating the blood brain barrier.It is known that halogenated compounds are more lipophilic and poorly soluble in water, which leads to favorable penetration of the drug through the barrier.This fact is reflected by the positive sign of the coefficient of D2 in Eq. 1. D4 describes the availability/reactivity of electron pairs (as H-bond acceptors); D5 is a measure of the polarity of the hydrophobic part of the solvent accessible surface area, as it is weighted by the partial charges of the hydrogen and sp 3 hybridized carbon atoms.
In addition to the "rule of five", we recommend when searching (screening) for drugs with high logBB to combine the following characteristics based on the above descriptors: i) hydrogendonor ability D1<0.002 ii) D2=2, two halogenated benzene groups (preferably with attached F) connected together with a nonrotatable bridge, iii) D6 >3, three tertiary amino groups located spatially as close as possible and iv) maximum net charge on N atoms, D3 < -0.1.
Regarding the domain of the applicability of Eq. 1, it can be defined based on the ranges of the independent and dependent variables.Since the data set used to build Eq. 1 is diverse, we suggest that for reliable prediction of logBB for novel drugs candidates, the descriptor values of these should be within the range [Di min -f.|Di min |, Di max + f.|Di max |], where the "applicability" coefficient f = 0.3 and i=1-6.The value of the latter coefficient was determined empirically by taking into account the overall predictive ability of Eq. 1 for the current number of datapoints and its physico-chemical reasoning.Note that the domain of applicability for (1) is not the same as statistical applicability defined by the standard deviation at certain confidence level.It involves a wider chemical space composed from the current drug set plus the external drugs set.Furthermore, the predictions of logBB that lead to deviations from the interval [logBB obsd minf.|logBBobsd max -logBB obsd min |, logBB obsd max + f.|logBB obsd max -logBB obsd min |] should be carefully analyzed and if necessary discarded.It is worth mentioning that the above minimum and maximum values for Di and logBB(observed) concern the current data set.
In principle applicability domain of a model is defined only by the descriptors in the equations.However, because of the significant diversity of the BB set we have also included LogBB (obsd) limits in order to be sure for the correctness of external predictions.In our case, the whole applicability domain consists of a cube in 7-dimensional space spanned over six descriptors plus the property.To give a sense of the applicability (domain) bounds, Figure 3 presents the scatter plot between the predicted by Eq. 1 LogBB (open blue circles) and the descriptor values of D1.In Figure 3, it is shown the two-dimensional rectangular which defines the upper and lower bounds according to the limits ([-2.429,2.419; 0.00204, 0.128]) described in the above paragraph with coefficient of applicability 0.3.As can be seen from Figure 3, all predicted LogBB are within the 2-dimensional rectangular of the domain of applicability.However, for external predictions, points outside this rectangular must be analyzed.

BMLR model for logHSA
The second BMLR model concerned a data set of 85 drugs with experimental logHSA values.The best MLR model obtained for logHSA was a 5-descriptor equation as expressed by Eq. 2.
The experimental and predicted logHSA values for this model are collected in Table 2.The graphical presentation of the correlation between these values is shown in Figure 4 together with the prediction bands at the 95% confidence level.The statistical criteria for Eq. 2 indicate good correlation bearing in mind the diversity of the drugs as well as the number of data points.The two statistically most significant descriptors according to t-values in Eq. 2 are D9 and D10.These two molecular features are related to the charged surface area and the total difference between the charged positive and negative charged surface areas suggesting their importance for the HSA binding phenomenon.In addition, D9 (|t D9 |>|t D10 |) is related to the hydrogen atoms which indicates that the hydrogen-bonding capacity of the drugs plays an important role for in HSA binding.In this regard, it is worth mentioning that D8 -Partial surface area of O atoms appears also in Eq.2.The remaining two descriptors D11 and D7 characterize the molecule as a whole.However, the latter descriptor is related to the total weight of the compound and hydrophobicity.According to its sign, it suggests that the logHSA would increase with the increase of D7 for the current set used.
Since the research involving HSA goes in both directions i.e. seeking for drugs with highly binding affinity and compounds with low binding affinity, we therefore could choose to search for active compounds with high logHSA satisfying the following simultaneous constrains of the descriptors: i) D9>13, ii) compounds with more separated positive minus negative charged surface i.e.D10>0.02,iii) number of C atoms D7>22, iv) compounds with lower O surface, D8<0.1.
It is important to note that for both models the constraints defined for search of active compounds must be within the limits of the minimum and maximum descriptor values defined by the applicability coefficient f.For example, for Eq. 2 and D7 is said to be searched for active compounds with D7>22 that implies the search to be carried out within the range of D7 [22, D7 max + f.D7 max ].In addition, it should be taken into account the optimum (for large logHSA, logBB) between the groups of descriptors, which lead to positive and negative contributions in equations 1 and 2.

ABC validation for MLR models of logBB and logHSA
One of the main goals of this study was also to develop as general as possible MLR models in order to cover a larger chemical space of applicability.Therefore, for the sake of generality, we did not use the division of data into training and external test sets.Instead of direct external validation, a type of leave-many-out validation, i.e. the ABC validation (see Methodology) was employed to "mimic" external validation.The efficiency of QSAR models to predict the scales of log BB and logHSA was assessed by the squared correlation coefficients R 2 (Pred) between experimental and predicted data for test sets (A, B or C).The results from the ABC validation are presented in Table 4.The overall assessment of the predictions for both models (1) and ( 2) is significant as can be noted by the closeness between the average R 2 , R 2 (Fit) and R 2 (Pred) values.cv(Fit) -cross-validated squared correlation coefficient of the multilinear equations for the binary sets (leave-one-out approach) R 2 (Pred) -squared correlation coefficient for the regression between the predicted values (from the models of the binary sets) for the test sets and the respective experimental values The prediction ability of Eq. 1 according to leave-one-out cross-validated R 2 cv (LOO) = 0.748 is significantly high.However, a significant difference (∆R 2 = 0.061) between R 2 cv to R 2 may suggest that Eq. 1 is sensitive toward the exclusion of certain drug(s) during the LOO procedure.A possible reason is that the set is quite diverse and thus the exclusion of certain data points influences the prediction by the model.However, the quality of the average predictions increases when the model is examined using the more robust ABC validation (see Table 4).

Neural network model for logBB
Before the neural network treatment was started, the experimental logarithmic values and descriptor values were both normalized to a range 0-0.9 for internal consistency.Then the significant descriptors were selected by reducing the initial descriptor pool as described in the Methodology part.For each property, the available experimental data were divided into training and test sets.To preserve the generality of the network models, all test sets consisted of not more than ten data points.Further, sensitivity analyses were performed on the reduced descriptor space by constructing 1-1-1 neural networks and then the descriptors that produced the lowest RMS error were selected.Several neural network models with different architecture were investigated for each property.The best model found for logBB had an architecture of 4-3-1.The input descriptors of this model are: Number of halogen groups, SQRT (HA dependent HDCA-2 ), Maximum net atomic charge for N atoms and Maximum electrophilic reactivity index (AM1) for H atoms.As can be noted from Table 3, there is, quite similarity between the nature of the descriptors for the BMLR and ANN models.Hence, similar physico-chemical analysis holds for the ANN descriptors.Also, the HA dependent HDCA-2 descriptor appeared as square root function indicating the nonlinear relation with the property.
The statistical parameters in terms of R 2 for this model resulted in R 2 train =0.85 and R 2 val = 0.87 (RMS 0.26 and 0.32, respectively).The graphical presentation of these correlations between the experimental and predicted for both training and validation sets are shown in Figure 5.The numerical values for this ANN model are included in Table 2.As can be noted most descriptors for this ANN model have similar chemical content as in the case of the linear model (C, O and Cl atoms).Also, HASA-2/SQRT(TMSA) is related to hydrogen-bonding acceptor charged surface area.However, the incorporation of nonlinear transformation by ANN gives a significant improvement of the QSAR model.

ANN -logBB
Generally, the applicability domains of both ANN models for logHSA and logBB are difficult to define compared to the BMLR models.However, because of the similar nature of the descriptors the same limits for the BMLR models can be also applied for these ANN models with the respective applicability coefficients.
As was shown in Figure 3, similar domain presentation can be done for both ANN models.However, because of the better predictivity of the ANN models their coefficient of applicability can be extended to 0.4-0.5.

Conclusions
The overlapping chemical data set of 126 drugs allows development of QSAR correlations for two biological properties i.e. blood-brain barrier and human serum albumin binding with their theoretical descriptors.The results obtained indicate that both the multilinear regression and ANN models exhibit reasonable prediction capabilities.While the linear models were developed mainly for the purpose of structure-activity interpretation, the ANN models were primarily developed for predictions and classification.It is worth noting that the drug structures obtained by conformational search resulting in energy minimized conformers significantly improved both the BMLR and ANN models.
The applicability domains were defined for both BMLR and ANN models for searching of active compounds with predetermined property values, Therefore, the models reported should enable prediction and screening for analogous drugs for their blood-brain barrier penetration and human serum albumin binding.

Figure 2 .
Figure 2. Experimental and predicted logBB values based on equation 1.

Table 1 .
Summary of QSAR studies for the prediction of blood brain distribution

Table 2 .
Continued a drugs used for validation set of the ANN, b -all smile strings for the compounds are available in supplementary material SM2

Table 3 .
Descriptors involved in BMLR models (1) and(2) a-descriptor values are available in supplementary material SM1

Table 4 .
Statistical characteristics for ABC validation of the multilinear models 1 and 2.