Program : QDB
Version : 950621
Author : Gerard J. Kleywegt, Dept. of Cell and Molecular Biology,
Uppsala University, Biomedical Centre, Box 590,
SE-751 24 Uppsala, SWEDEN
E-mail : gerard@xray.bmc.uu.se
Purpose : quality analysis of 476 PDB entries
Package : stand-alone
* 1 * G.J. Kleywegt (1996). Use of non-crystallographic symmetry in protein structure refinement. Acta Cryst D52, 842-857. [http://www.iucr.ac.uk/journals/acta/tocs/actad/1996/actad5204.html]
9409XX - 0.0 - initial version (code lost in disk crash)
941017 - 0.1 - started reprogramming
941018 - 0.2 - continued reprogramming
941020 - 0.3 - finished reprogramming
941231 - 0.4 - added EXTRA1 records and properties
950102 - 1.0 - added EXTRA2 records and properties; added SCORE;
removed bug (COMPND wasn't read from the database)
950118 - 1.1 - sensitive to environment variable GKLIB
950621 - 1.2 - calculate significance for correlations
QDB is a simple program for analysing a small database containing various statistics and quality-indicator values for protein structures solved by X-ray crystallography.
The present database file is called "quality.lib" and contains data pertaining to 476 proteins, solved at resolutions between 1.5 and 3.5 A. It was generated by G.J Kleywegt, for structures from the Brookhaven PDB, using PROCHECK, LSQMAN and several 'jiffy' programs.
NOTE: This program is sensitive to the environment variable GKLIB. If set, the name of this directory will be prepended to the default name for the library file needed by this program. For example, in Uppsala, put the following line in your .login or .cshrc file: setenv GKLIB /nfs/public/lib
When you start the program, it prints a header, current dimensioning and a list of available commands. But before that, you have to provide the name of your quality-database file:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- % 52 gerard onyx 20:17:33 progs/qdb > QDB*** QDB *** QDB *** QDB *** QDB *** QDB *** QDB *** QDB *** QDB *** QDB ***
Version - 950102/1.0 (C) 1993-5 Gerard J. Kleywegt, Dept. Mol. Biology, BMC, Uppsala (S) User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL) Others - T.A. Jones, G. Bricogne, Rams, W.A. Hendrickson Others - W. Kabsch, CCP4, PROTEIN, E. Dodson, etc. etc.
Started - Mon Jan 2 21:41:05 1995 User - gerard Mode - interactive Host - onyx ProcID - 24473 Tty - /dev/ttyq1
*** QDB *** QDB *** QDB *** QDB *** QDB *** QDB *** QDB *** QDB *** QDB ***
Max nr of proteins in database : ( 600) Max nr of criteria in database : ( 75) Max nr of comments per protein : ( 5) Max nr of selection save sets : ( 10) Max nr of histogram bins : ( 50)
Initialising properties ...
Initialising database ...
Name of database file ? (quality.lib)
Reading database ...
Nr of lines read : ( 7183) Nr of proteins : ( 476)
Calculating scores ...
Quality indicators used in scoring: BAD | POOR | FAIR | OKAY | GOOD Name -2 | -1 | 0 | +1 | +2 Weight RESOL 2.20 2.00 1.50 1.20 5.00 RFAC 0.25 0.20 0.15 0.10 1.00 ... %DIH10 15.00 12.00 8.00 5.00 1.00 %ANG5 15.00 12.00 8.00 5.00 1.00
QUit ? (list commands) $ shell_command ! (comment) SHow [prop] LIst pdbid
STats prop COrr prop1 prop2 [plotfile] SOrt prop [nlist xp1 xp2 xp3] ALl_corr prop [cut-off] HIsto prop [bins min max plotfile]
SElect ALl SElect NOne SElect ANd VAlid prop SElect OR VAlid prop SElect ANd IF prop operator value SElect OR IF prop operator value SElect SAve saveset comment SElect REstore saveset SElect ? SElect INvert
CPU total/user/sys : 7.1 7.0 0.1 QDB [476/476] > ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
The prompt ( QDB [476/476] > ) indicates how many out of how many proteins are currently selected. Only selected proteins will be used in the following commands.
There are three types of command:
- general commands (quit, list, etc.)
- selection commands
- analysis commands
Proteins in the database have 59 numeric properties (at present) and some text attributes. The numeric properties are referred to by their name. The SHow command (without parameter, or with * as parameter) lists all properties, their type (Real or Integer), their default value (if unobserved), the range of valid values (a value outside this range is considered unobserved), and a short description:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > sh Nr Name T Default Minimum Maximum Description 1 YEAR I -1 50 99 Year of deposition at PDB 2 Z I -1 1 96 Nr of asymmetric units per unit cell 3 NATOMS I -1 100 1000000 Nr of atoms in the PDB file 4 NHET I -1 0 1000000 Nr of HETERO atoms in the PDB file ... 72 %DIH10 R -1.00 0.00 100.00 % Residues |delta-CA-CA*-CA-CA(NCS)| > 10 73 AADANG R -1.00 0.00 180.00 Average |delta-CA-CA*-CA(NCS)| 74 %ANG5 R -1.00 0.00 100.00 % Residues |delta-CA-CA*-CA(NCS)| > 5 75 SCORE R 0.00 -999.00 999.00 Grand quality score ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
Proteins are refered to by their four-character PDB identifier. This can be used with the LIst command to retrieve all defined properties of a certain structure:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > li 1guh List : (1GUH) Number : ( 311) Date : (24-FEB-93) Spcgrp : (C2) Jrnl : (J.MOL.BIOL.) Header : (TRANSFERASE(GLUTATHIONE)) Compnd : (????) Author : (I.SINNING,G.J.KLEYWEGT,T.A.JONES) Remark : (GROUPED BS (NOTE: SHOULD DIVIDE NR OF ATOMS BY 2 !)) Remark : (I.SINNING,G.J.KLEYWEGT,T.A.JONES) Remark : (-) Remark : (-) Remark : (-) YEAR 93 Year of deposition at PDB Z 4 Nr of asymmetric units per unit cell NATOMS 3646 Nr of atoms in the PDB file NHET 54 Nr of HETERO atoms in the PDB file ... %PSI10 0.00 % Residues with |delta-Psi(NCS)| > 10 AADDIH 0.00 Average |delta-CA-CA*-CA-CA(NCS)| %DIH10 0.00 % Residues |delta-CA-CA*-CA-CA(NCS)| > 10 AADANG 0.00 Average |delta-CA-CA*-CA(NCS)| %ANG5 0.00 % Residues |delta-CA-CA*-CA(NCS)| > 5 SCORE 36.00 Grand quality score ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
The simples analysis option is STats. For example to find the highest and lowest overall G-factor in the database, use:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > st pogfac Nr of selected proteins : ( 468) Nr Name T Default Minimum Maximum Description 59 POGFAC R -99.00 -50.00 50.00 Overall G-factor Average value : ( -4.524E-01) St. deviation : ( 5.541E-01) Minimum value : ( -7.700E+00) Maximum value : ( 4.000E-01) Sum of values : ( -2.117E+02) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
Note that only structures for which the G-factor has actually been calculated are included (i.e., 468 out of 476 structures).
The COrrelation command calculates the correlation coefficient between two properties (again, only for those of the currently selected ones for which both properties are defined):
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > sh rmsdbb Nr Name T Default Minimum Maximum Description 35 RMSDBB R -1.00 0.00 999.00 RMS delta-B bonded atoms (A**2) QDB [476/476] > sh rmsdbn Nr Name T Default Minimum Maximum Description 45 RMSDBN R -1.00 0.00 999.00 RMS delta-B of NCSCAI CAs improved LSQ ( QDB [476/476] > cor rmsdbb rmsdbn Nr of selected proteins : ( 324) Nr Name T Default Minimum Maximum Description 35 RMSDBB R -1.00 0.00 999.00 RMS delta-B bonded atoms (A**2) Average value : ( 4.244E+00) St. deviation : ( 4.085E+00) Minimum value : ( 9.000E-02) Maximum value : ( 3.219E+01) Sum of values : ( 1.375E+03) Nr Name T Default Minimum Maximum Description 45 RMSDBN R -1.00 0.00 999.00 RMS delta-B of NCSCAI CAs improved LSQ ( Average value : ( 8.212E+00) St. deviation : ( 5.119E+00) Minimum value : ( 0.000E+00) Maximum value : ( 3.560E+01) Sum of values : ( 2.661E+03) Nr of values : ( 324) Corr. coeff. : ( 0.487) Plot file : (rmsdbb_rmsdbn.plt) Plot file written ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
In this case, the correlation between "RMS delta-B bonded atoms" and "RMS delta-B of the NCSCAI CA atoms" is investigated: apparently, people who use strong/weak/no restraints for Bs of bonded atoms also use strong/weak/no restraints for the temperature factors of NCS- related atoms.
The command also produced an O2D plot file. If you enter "none" as the filename, no such file will be created. If you don't provide a filename at all, a sensible default is generated. Note that this produces a scatter plot (i.e., use the SC command in O2D, or add "sc" if you use the OMAC/o2dps script).
The plot file may look as follows:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- REMARK Created by QDB V. 941020/0.3 at Thu Oct 20 16:47:57 1994 for user gerard REMARK Plot file rmsdbb_rmsdbn.plt REMARK Scatter plot of RMSDBB = RMS delta-B bonded atoms (A**2) REMARK And RMSDBN = RMS delta-B of NCSCAI CAs improved LSQ ( REMARK Number of selected proteins = 324 REMARK Correlation coefficient = 0.4873894 REMARK Minimum of RMSDBB = 9.0000004E-02 ... Maximum = 32.19000 REMARK Average of RMSDBB = 4.244442 ... Standard deviation = 4.085301 REMARK Minimum of RMSDBN = 0.0000000E+00 ... Maximum = 35.60000 REMARK Average of RMSDBN = 8.212036 ... Standard deviation = 5.118954 XLABEL RMSDBB YLABEL RMSDBN COLOUR 4 NPOINT 324 XYVIEW -0.8729999 33.15300 -1.068000 36.66800 XVALUE * 5.6600E+00 8.7800E+00 9.8700E+00 2.6700E+00 1.5700E+00 5.6800E+00 3.3000E+00 3.4900E+00 3.1700E+00 1.0000E+00 2.1000E+00 5.0000E-01 ... 1.2800E+00 3.0600E+00 2.4100E+00 2.4800E+00 1.3400E+00 3.9000E+00 YVALUE * 4.0000E+00 1.0900E+01 1.2100E+01 6.5000E+00 2.9000E+00 6.6000E+00 ... 6.0000E+00 1.3900E+01 1.1000E+01 3.2000E+00 3.3000E+00 3.6000E+00 END ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
The ALl_corr command calculates the correlation coefficient between one property and all the others (no plot files are produced). If the absolute value of the correlation coefficient exceeds the value of "cutoff" (default 0.0), a message is printed:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > al rmsdbb 0.4 Property to correlate : (RMSDBB)Property : (BMODE) Nr of proteins : ( 434) Corr. coeff. : ( 1.000)
Property : (BSDV) Nr of proteins : ( 434) Corr. coeff. : ( 0.550)
Property : (CORRBB) Nr of proteins : ( 434) Corr. coeff. : ( -0.836)
Property : (RMSDBN) Nr of proteins : ( 324) Corr. coeff. : ( 0.487) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
If you want to look at the distribution of values for a property, use the HIstogram command. Provide the name of the property and, optionally:
- either MINUS the number of bins, or the size of the bins
- the minimim value to consider
- the maximum value to use
- the name of the histogram plot file
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > hi resol 0.1 1.4 3.6 Nr of selected proteins : ( 476) Nr Name T Default Minimum Maximum Description 14 RESOL R -1.00 0.10 5.00 Nominal resolution of the data (A) Average value : ( 2.403E+00) St. deviation : ( 4.141E-01) Minimum value : ( 1.500E+00) Maximum value : ( 3.500E+00) Sum of values : ( 1.144E+03) Nr of bins : ( 22) Bin width : ( 0.100) Bin 2 [ 1.5000E+00, 1.6000E+00] Nr = 3 ( 0.63 %) Cumul 3 Bin 3 [ 1.6000E+00, 1.7000E+00] Nr = 9 ( 1.89 %) Cumul 12 Bin 4 [ 1.7000E+00, 1.8000E+00] Nr = 42 ( 8.82 %) Cumul 54 Bin 5 [ 1.8000E+00, 1.9000E+00] Nr = 2 ( 0.42 %) Cumul 56 Bin 6 [ 1.9000E+00, 2.0000E+00] Nr = 35 ( 7.35 %) Cumul 91 Bin 7 [ 2.0000E+00, 2.1000E+00] Nr = 55 ( 11.55 %) Cumul 146 Bin 9 [ 2.2000E+00, 2.3000E+00] Nr = 30 ( 6.30 %) Cumul 176 Bin 10 [ 2.3000E+00, 2.4000E+00] Nr = 27 ( 5.67 %) Cumul 203 Bin 11 [ 2.4000E+00, 2.5000E+00] Nr = 17 ( 3.57 %) Cumul 220 Bin 12 [ 2.5000E+00, 2.6000E+00] Nr = 103 ( 21.64 %) Cumul 323 Bin 13 [ 2.6000E+00, 2.7000E+00] Nr = 2 ( 0.42 %) Cumul 325 Bin 14 [ 2.7000E+00, 2.8000E+00] Nr = 28 ( 5.88 %) Cumul 353 Bin 15 [ 2.8000E+00, 2.9000E+00] Nr = 62 ( 13.03 %) Cumul 415 Bin 16 [ 2.9000E+00, 3.0000E+00] Nr = 18 ( 3.78 %) Cumul 433 Bin 17 [ 3.0000E+00, 3.1000E+00] Nr = 33 ( 6.93 %) Cumul 466 Bin 19 [ 3.2000E+00, 3.3000E+00] Nr = 8 ( 1.68 %) Cumul 474 Bin 21 [ 3.4000E+00, 3.5000E+00] Nr = 2 ( 0.42 %) Cumul 476 Plot file : (resol_histo.plt) Plot file written ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
The plot file should be plotted/converted with the HI or PI command in O2D (or OMAC/o2dps). It may look as follows:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- REMARK Created by QDB V. 941020/0.3 at Thu Oct 20 16:57:09 1994 for user gerard REMARK Plot file resol_histo.plt REMARK Histogram plot of RESOL = Nominal resolution of the data (A) REMARK Number of selected proteins = 476 REMARK Minimum of RESOL = 1.500000 ... Maximum = 3.500000 REMARK Average of RESOL = 2.403317 ... Standard deviation = 0.4140590 XLABEL RESOL YLABEL Nr of proteins in bin COLOUR 4 NPOINT 23 XYVIEW 1.334000 3.666000 -3.090000 106.0900 XVALUE * 1.4000E+00 1.5000E+00 1.6000E+00 1.7000E+00 1.8000E+00 1.9000E+00 2.0000E+00 2.1000E+00 2.2000E+00 2.3000E+00 2.4000E+00 2.5000E+00 2.6000E+00 2.7000E+00 2.8000E+00 2.9000E+00 3.0000E+00 3.1000E+00 3.2000E+00 3.3000E+00 3.4000E+00 3.5000E+00 3.6000E+00 YVALUE * 0.0000E+00 3.0000E+00 9.0000E+00 4.2000E+01 2.0000E+00 3.5000E+01 5.5000E+01 0.0000E+00 3.0000E+01 2.7000E+01 1.7000E+01 1.0300E+02 2.0000E+00 2.8000E+01 6.2000E+01 1.8000E+01 3.3000E+01 0.0000E+00 8.0000E+00 0.0000E+00 2.0000E+00 0.0000E+00 0.0000E+00 END ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
If you want to find out who solved the best and worst structures according to a single criterion, use the SOrt command. Provide the name of the property to sort on and, optionally:
- the number of proteins to list (0 means all selected proteins, a positive number means the top N entries, and a negative number means both the top and bottom N entries)
- the names of up to three additional numeric properties which should also be listed
For example, to sort by overall G-factor, use:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > so pogfac -10 resol rmsimp year Sorted by : (POGFAC) Also list : (RESOL) Also list : (RMSIMP) Also list : (YEAR) List : ( -10) Nr of selected proteins : ( 468) Nr Indx ID POGFAC RESOL RMSIMP YEAR Authors 1 24 4HHB -7.700 1.740 0.364 84 G.FERMI,M.F.PERUTZ 2 21 4DFR -3.400 1.700 0.484 82 D.J.FILMAN,D.A.MATTHEWS,J.T.BOLIN,J.KRAU 3 469 1HCY -2.500 3.200 -1.000 91 A.VOLBEDA,W.G.J.HOL 4 125 2SOD -2.400 2.000 0.882 80 J.A.TAINER,E.D.GETZOFF,J.S.RICHARDSON,D. 5 468 1HC1 -2.100 3.200 -1.000 91 A.VOLBEDA,W.G.J.HOL 6 302 1RBA -2.000 2.600 0.925 91 G.SCHNEIDER,E.SODERLIND 7 403 3HVP -2.000 2.800 -1.000 89 A.WLODAWER,M.JASKOLSKI,M.MILLER 8 164 1FXI -1.900 2.200 0.539 90 T.TSUKIHARA 9 374 1CID -1.800 2.800 -1.000 93 R.L.BRADY,E.J.DODSON,G.LANGE 10 391 2AAT -1.600 2.800 -1.000 89 D.SMITH,S.ALMO,M.TONEY,D.RINGE ============== 459 39 1DXU 0.200 1.800 0.253 92 J.S.KAVANAUGH,A.ARNONE 460 34 3MDS 0.200 1.800 0.159 93 M.L.LUDWIG,A.L.METZGER,K.A.PATTRIDGE,W.C 461 40 1DXV 0.200 1.800 0.271 92 J.S.KAVANAUGH,A.ARNONE 462 5 4SDH 0.200 1.600 0.246 93 W.E.ROYERJUNIOR 463 475 1HNB 0.200 3.500 0.781 93 S.RAGHUNATHAN,R.J.CHANDROSS,R.H.KRETSING 464 439 1HNC 0.300 3.000 0.923 93 S.RAGHUNATHAN,R.J.CHANDROSS,R.H.KRETSING 465 114 1FIA 0.300 2.000 0.333 91 D.KOSTREWA,J.GRANZIN,H.-W.CHOE,J.LABAHN, 466 20 2MSB 0.300 1.700 0.265 92 W.I.WEIS,K.DRICKAMER,W.A.HENDRICKSON 467 97 1DSB 0.300 2.000 0.633 93 J.L.MARTIN,J.C.A.BARDWELL,J.KURIYAN 468 110 2NCK 0.400 2.000 0.253 93 R.L.WILLIAMS,D.A.OREN,E.ARNOLD ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
If you want statistics for only a subset of the proteins (e.g., all structures with NCS, solved after 1988 at resolutions below 2.5 A), use the SElect command:
- ALl selects all proteins
- NOne selects none of the proteins
- INvert selects the complement of the previously selected proteins
- ANd VAlid only keeps those for which a certain property is defined
- OR VAlid adds those for which a certain property is defined
- ANd IF only keeps those for which the value of a defined property
satisfy an expression (<, =, or > than a cut-off value)
- OR IF adds those for which the value of a defined property
satisfy an expression (<, =, or > than a cut-off value)
- SAve stores the current selection (so-called saveset)
- REstores restores a previously stored saveset
- ? prints a listing of the savesets
From version 1.0 onward, you can also select by some of the text attributes, namely:
- AUthor
- COmpound
- JOurnal
Each of these three types of selection works as an AND on the current set of selected proteins.
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > sel all All proteins selected QDB [476/476] > sel and valid numncs Nr of selected proteins : ( 345) QDB [345/476] > sel and if year > 88 Nr of selected proteins : ( 331) QDB [331/476] > sel and if resol > 2.5 Nr of selected proteins : ( 90) QDB [ 90/476] > se sav 1 "ncs after 1988 worse than 2.5 A resol" Saveset 1 [ 90/ 476 ] = ncs after 1988 worse than 2.5 A resol QDB [ 90/476] > se ? Saveset 1 [ 90/ 476 ] = ncs after 1988 worse than 2.5 A resol Saveset 2 [ 0/ 476 ] = NO proteins selected Saveset 3 [ 0/ 476 ] = NO proteins selected Saveset 4 [ 0/ 476 ] = NO proteins selected Saveset 5 [ 0/ 476 ] = NO proteins selected Saveset 6 [ 0/ 476 ] = NO proteins selected Saveset 7 [ 0/ 476 ] = NO proteins selected Saveset 8 [ 0/ 476 ] = NO proteins selected Saveset 9 [ 0/ 476 ] = NO proteins selected Saveset 10 [ 0/ 476 ] = NO proteins selected ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > se jo nature Nr of selected proteins : ( 20) QDB [ 20/476] > se all All proteins selected QDB [476/476] > se au w.g.j.hol Nr of selected proteins : ( 13) QDB [ 13/476] > se all All proteins selected QDB [476/476] > se comp dismutase Nr of selected proteins : ( 10) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
From version 1.0 onward, an overall score is calculated for each
protein. This done by taking a number of quality indicators and
classifying them as BAD, POOR, FAIR, OKAY and GOOD. Each of these
indicators has a weight W. All proteins start with a score of zero.
If a protein scores BAD for a criterion, -2 * W is added to its score,
if it's POOR, -W is added; OKAY gives +W and GOOD +2 * W.
The scoring formula is highly subjective of course, but I tend to find
that it agrees well with my own impression of the quality and
reliability of protein models.
The computed values is stored in a property called "SCORE" and can be
used for selecting and sorting etc.
The following properties are used in the formula:
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- Quality indicators used in scoring: BAD | POOR | FAIR | OKAY | GOOD Name -2 | -1 | 0 | +1 | +2 Weight RESOL 2.20 2.00 1.50 1.20 5.00 RFAC 0.25 0.20 0.15 0.10 1.00 RHO 0.50 1.00 1.50 3.00 1.00 BWAVE 60.00 50.00 40.00 30.00 2.00 BAVE 40.00 30.00 20.00 10.00 2.00 RMSDBB 10.00 7.50 5.00 2.50 5.00 RMSIMP 0.50 0.30 0.20 0.10 3.00 RMSDBN 10.00 7.50 5.00 2.50 3.00 PRMFRP 70.00 80.00 85.00 90.00 3.00 PRDARP 2.00 1.50 1.00 0.50 5.00 POGFAC -2.00 -1.00 -0.50 0.00 3.00 DACA -2.00 -1.00 -0.50 0.00 5.00 FLIP 4.00 3.00 2.00 1.00 3.00 BADRSC 20.00 15.00 10.00 5.00 3.00 %PHI10 15.00 12.00 8.00 5.00 1.00 %PSI10 15.00 12.00 8.00 5.00 1.00 %DIH10 15.00 12.00 8.00 5.00 1.00 %ANG5 15.00 12.00 8.00 5.00 1.00 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
From version 1.2 onward, the significance of the correlations found with COrr and ALl_corr is calculated. See chapter 13.7 of "Numerical Recipes". The standard deviation of the correlation coefficient, r, is roughly 1/SQRT(N), and the significance of a correlation is erfc(|r| * SQRT (N/2)), with erfc the complementary error function. A *small* value for this number indicates that the two distributions are significantly correlated. (Given the large values for N, this is almost always the case here ;-)
However, note that the assumption of rapidly dying tails for the individual distributions is not true (they are more Poisson-like), so that the significance of the significance hass to be taken with a large rock of salt. In fact, it is probably impossible to decide if a correlation is significant in the cases considered here (I couldn't find any appropriate tests, nor could the assembled statisticians on Usenet (sci.stats.help, I think it was)).
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- QDB [476/476] > al resol Property to correlate : (RESOL) ... Property : (PNBC) Nr of proteins : ( 468) Corr. coeff. : ( 0.004) Significance : ( 3.440E+00) ... Property : (PRMFRP) Nr of proteins : ( 468) Corr. coeff. : ( -0.480) Significance : ( 2.793E-25) ... ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
None, at present.