| Version: | 1.2 |
| Name: | (Q)SAR Model Reporting Format |
| Author: | Joint Research Centre, European Commission |
| Date: | July 2007 |
| Contact: | Joint Research Centre, European Commission |
| e-mail: | qsardb@jrc.it |
| www: | http://ecb.jrc.ec.europa.eu/qsar/ |
Nonlinear QSAR: aritficial neural network for mouse carcinogenicity
QSARModel 3.3.8
Turu 2, Tartu, 51014, Estonia
http://www.molcode.com
Statistica 7
StatSoft Ltd
http://www.statsoft.com
21.04.2010
Dimitar Dobchev
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Tarmo Tamm
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Gunnar Karelson
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Indrek Tulp
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Dana Martin
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Kaido Tämm
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Deniss Savchenko
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Jaak Jänes
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Eneli Härk
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Andres Kreegipuu
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Mati Karelson
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Molcode model development team
Molcode Ltd.
Turu 2, Tartu, 51014, Estonia
models@molcode.com
http://www.molcode.com
Molcode model development team
Molcode Ltd
Molcode Ltd Turu 2, Tartu, 51014, Estonia
models@molcode.com
www.molcode.com
12.04.2010
Statistica 7
Training, selection and test sets available. Algorithm available.
None to date.
Mouse
4.Human health effects. . 4.12.Carcinogenicity
Carcinogenicity was determined using the OECD Test Guideline 451 (EU Test Guideline B.32). This method describes the administration of test substance normally seven days per week, by an appropriate route, to several groups of experimental animals, one dose per group, for a major portion of their lifespan, and the daily observation of experimental animals for detection of signs of toxicity, particularly the development of tumours. Chemical carcinogens have been categorized as either genotoxic or non-genotoxic. The former are DNA reactive and the latter act by a variety of other mechanisms. The toxicological property of interest was the carcinogenic potency, expressed as TD50 value. The TD50 value for a given target site (s) in the absence of tumors in control animals, was taken to be the chronic dose (in mg/kg body wt/day) which induced tumors in half of the test animals at the end of a standard lifespan for the species. The TD50 value used for each compound was selected by taking into account the lowest carcinogenic potency value reported for each chemical in all the positive reports for that chemical.
TD50 [mmol/kg]
Log (TD50)
Experimental data from different sources have been validated as reliable (ref.5)
A data set of 340 compounds was collected from the database of Chemical carcinogens: structures and experimental data (ISSCAN) which contains information on chemical compounds tested with the long-term carcinogenicity bioassay on rodents (rat, mouse). Beside being a repository of data, it has been specifically designed as an expert decision support tool. Historically, this database originates from the experience of researchers of the Environment and Primary Prevention Department in the field of structure-activity relationships, aimed at developing models which theoretically predict the carcinogenicity of chemicals. The use of experimental carcinogenicity data for structure-activity relationship studies amplifies their informative value, and contributes to the reduction and replacement of animal experimentation. This database does not contain neither epidemiological data nor regulatory classifications of the carcinogens, but only the experimental results from the carcinogenicity bioassay. The structure of this database is inspired by that of the Distributed Structure-Searchable Toxicity (DSSTox) Network of the US Enviromental Protection Agency (EPA) (http://www.epa.gov/nheerl/dsstox/). Similarly to the DSSTox spirit this project wants to contribute to the free diffusion of scientific data in a standardized, easy to read format [2]. Source of carcinogenicity data: Carcinogenic Potency Database (CPDB) established by Gold and Zeiger (1997) [3,4], TOXNET CCRIS, National Toxicology Program (NTP), International Agency for Research on Cancer (IARC), Survey of Compounds which have been tested for Carcinogenic Activity (SOC), European Inventory of Existing Commercial Chemical Substances (EINECS).References1. Gold et al (1999) (ref 4/ sect 9.2.)2. DSSTox (ref 5)3. Gold L.S. and Zeiger E., 1997 (ref 6)4. Carcinogenic Potency Database (CPDB) (ref 7)
Neural network
Neural network
Nonlinear QSAR: Backpropagation Neural Network (Multilayer Perceptron) regression
The algorithm is based on neural network predictor with structure 9-9-8-1.
The algorithm is given in th eANN.snn file. In order to be used the user must have Statistica 7 or higher with ANN modules to make predictions.
Highest e-n attraction (AM1) for N - N bonds,
Highest e-e repulsion (AM1) for N - O bonds,
Lowest e-n attraction (AM1) for N - O bonds,
Max Sigma-Pi bond order (AM1),
Polarity parameter (AM1),
Tot molecular 1-center E-E repulsion (AM1),
DPSA2 Difference in CPSAs (PPSA2-PNSA2) (Zefirov),
Max electrophilic reactivity index (AM1) for C atoms,
Kier shape index (order 2),
Initial pool of ~1000 descriptors. Stepwise descriptor selection based on a set of statistical selection rules as F statistic and p. The highest F (low p) descriptors (9) were selected from the whole (~1000) descriptors taking into account also their value distribution. These 7 descriptors were used as inputs to the network. 21 networks with different structures were tested in order to find the best ANN with lowest RMS (root-mean-squared error). Then 152 epochs were used to train the final network with architecture depicted in 4.2. Optimization of the weights was performed with Levenberg-Marquardt algorithm using logistic activation function.
All descriptors were generated using QSARModel on structures optimized by the AM1 semiempirical quantum mechanical model.
QSARModel
http://www.molcode.com
37.7 ( 377 chemicals/ 10 discriptors)
Applicability domain based on training set:
By descriptor value range (between min and max values): The model is suitable for compounds that have the descriptors in the following range augmented with the confidence in 5.2:
Desc ID (See 4.3)
1 2 3 4 5 6 7 8 9
Min -441.347 0.0000 -546.667 0.000000 0.038500 286.705 19.680 0.00000 0.00000
Max 0.000 267.1622 0.000 0.096778 4.041800 6136.148 2534.643 0.06020 25.55286
Presence of functional groups in structures
Range of descriptor values in training set with ±30% confidence
Descriptor values must fall between maximal and minimal descriptor values (see5.1) of training set ±30%.
If for any compound whose descriptors falls in the interval [|min|-0.3|min|; |max|+0.3|max|], then the ANN model is applicable and the prediction is reliable. Where min and max are the values in 5.1.
QSARModel 3.3.8
http://www.molcode.com
See 5.2
Yes
Chemname:Yes
SMILES:No
CAS RN:Yes
InChI:No
MOL file:Yes
Formula:No
All
All
Data points: 260
Standardization and normalization by taking into account the mean and standard deviation
Training log(TD50) Selection log(TD50) Test log(TD50)
Data Mean -1.334 -1.487 -0.961
Data S.D. 1.411 1.410 1.355
Error Mean 0.022 -0.187 -0.029
Error S.D. 0.986 1.074 1.144
Abs E. Mean 0.728 0.791 0.926
S.D. Ratio 0.699 0.762 0.844
Correlation 0.716 0.657 0.550
RMS (Training) = 0.108, RMS (Selection) = 0.119, RMS (Test) = 0.125, See 6.7
Yes
Chemname:Yes
SMILES:No
CAS RN:Yes
InChI:No
MOL file:Yes
Formula:No
All
All
The method used two randonly selected validation sets – selection (40) and test(40)
Randomly selected 40 and 40 data points for selection and test set, respectively
See 6.7 and 6.12
The descriptors for the test set are in the limit of applicability, see 6.7 and 6.12
Overall predictions for the selection set (used to stop the ANN training and not to overfit it, this set has not been used in the training set, it can be also considered as external set) and the test set (used to test the external prediction of the net after training) are significant according to the standard deviation ratio (S.D.Ratio) and RMS error, see 6.7 and 6.12
The mechanistic picture of the model is complicated due to the mathematical nature of the ANN(artificial neural network). However, it is known that carcinogenicity is greatly related to N-containing compounds. In our case the descriptors Highest e-n attraction (AM1) for N - N bonds, Highest e-e repulsion (AM1) for N - O bonds and Lowest e-n attraction (AM1) for N - O bonds reflect this fact. For instance the higher is the Highest e-n attraction (AM1) for N - N bonds the larger is Log DT50. In other words lower attraction between N-N atoms makes easier the donation of elections or formation of hydrogen bonds which will increase the carcinogenicity. Also other variabilities of these interactions are reflected by O-N bonds. In addition to the above charged surface areas of the compounds and their structural characteristics contribute to the phenomenon under investigation
A posteriori mechanistic interpretation, consistent with published scientific interpretations of experiments.
The model interpretation is consistent with some published results [e.g. Morales et al, 2006].
Supporting information for: Training set(s), Selection set(s), Test set(s), ANN.snn file -includes the ANN model, in order to be used the user must have Statistica 7 or higher with ANN modules to make predictions.
The methodology and software (QSARModel) used to create the present model were applied also to obtain the results published in these papers: Katritzky et al. (2009), Karelson et al (2006)
Katritzky AR, Dobchev DA, Fara, C, Hur E, Tämm K, Kurunczi L, Karelson M, Varnek A & Solov'ev VP (2006). Skin Permeation Rate as a Function of Chemical Structure. Journal of Medicinal Chemistry 49, 3305-3314.
Karelson M, Dobchev DA, Kulshyn OV & Katritzky A (2006). Neural Networks Convergence Using Physicochemical Data. Journal of Chemical Information and Modeling 46, 1891-1897.
OECD Test Guideline 451 (EU Test Guideline B.32)
Gold LS, Manley NB, Slone TH & Rohrbach L (1999). Supplement to the Carcinogenic Potency Database (CPDB): results of animal bioassays published in the general literature in 1993–1994 and by the National Toxicology Program in 1995–1996. Environmental Health Perspectives, Suppl. 107, (Suppl. 4) 527–602.
Distributed Structure-Searchable Toxicity (DSSTox) database of the US Enviromental Protection Agency (EPA)
Gold LS & Zeiger E (1997). Handbook of Carcinogenic Potency and Genotoxicity Databases. CRC Press, Boca Roca, FL.
Carcinogenic Potency Database (CPDB)
Morales AH, Pérez MAC, Combes RD & Pérez González M (2006). Quantitative structure activity relationship for the computational prediction of nitrocompounds carcinogenicity. Toxicology 220, 51-62.
Training data set Mouse_carc_20100429_trainingsetValidation data set Mouse_carc_20100429_testsetMouse_carc_20100429_selectionseOther documents
Q17-10-1-225
2010/07/16
Molcode, Nonlinear QSAR, aritficial neural network, mouse carcinogenicity, Mouse