Uppsala Software Factory

Uppsala Software Factory - DEJAVU Manual

1 DEJAVU - GENERAL INFORMATION
2 REFERENCES
3 VERSION HISTORY
4 INTRODUCTION
5 QUICK START GUIDE

5.1 with (CA) coordinates

5.2 without coordinates
6 USER INPUT FILES

6.1 description

6.2 keywords

6.3 example
7 DATABASE
8 RUNNING THE PROGRAM

8.1 startup

8.2 options

8.3 LIst

8.4 EXtract

8.5 REad
9 FINDING A MOTIF

9.1 input

9.2 SSEs

9.3 search criteria

9.4 search constraints and O macro

9.5 output

9.6 algorithm

9.7 more hits
10 DEJANA
11 ANALYSING THE RESULTS

11.1 O macro

11.2 running O

11.3 analysis on the display
12 A REALISTIC EXAMPLE

12.1 SSE file

12.2 search parameters

12.3 output

12.4 O macro

12.5 running O
13 AUTOMATIC CREATION OF INPUT FILES

13.1 PRO1

13.2 PRO2

13.3 SSE file

13.4 makedb
14 DETAILED ANALYSIS OF RESULTS ON CRO

14.1 results
15 MISCELLANEOUS

15.1 HOW TO CREATE AND USE YOUR OWN DATABASE

15.2 HOW TO SELECT SEARCH PARAMETERS

15.3 OTHER HINTS

15.4 PROBLEMS
16 SELECT OPTION
17 INCREMENTAL SEARCH EXAMPLE
18 TOPOLOGY OPTION
19 INSTALLING THE SOFTWARE
20 SYMBOLIC MATCHING
21 RELEASE NOTES
22 KNOWN BUGS

1 DEJAVU - GENERAL INFORMATION

Program : DEJAVU
Version : 981127
Author : Gerard J. Kleywegt, Dept. of Cell and Molecular Biology, Uppsala University, Biomedical Centre, Box 590, SE-751 24 Uppsala, SWEDEN
E-mail : gerard@xray.bmc.uu.se
Purpose : detecting similarities/motifs in protein structures using a large database
Package : DEJAVU

2 REFERENCES

Reference(s) for this program:

* 1 * G.J. Kleywegt & T.A. Jones (1994). Halloween ... Masks and Bones. In "From First Map to Final Model", edited by S. Bailey, R. Hubbard and D. Waller. SERC Daresbury Laboratory, Warrington, pp. 59-66.

* 2 * G.J. Kleywegt & T.A. Jones (1997). Taking the fun out of map interpretation. CCP4/ESF-EACBM Newsletter on Protein Crystallography 33, January 1997, pp. 19-21. [http://alpha2.bmc.uu.se/usf/factory_7.html]

* 3 * G.J. Kleywegt & T.A. Jones (1997). Detecting folding motifs and similarities in protein structures. Methods in Enzymology 277, 525-545.

* 4 * G.J. Kleywegt & T.A. Jones (1999 ?). Chapter 25.2.6. O and associated programs. Int. Tables for Crystallography, Volume F. To be published.

3 VERSION HISTORY

921022 - 0.1 - Started programming; called program "AnalSecS" for "ANALyse SECondary Structure" ...
921029 - 1.0 - First working version released in-house; first version of the manual
921030 - 1.1 - Minor changes; continued manual; cro analysis
921031 - 1.2 - Minor changes to lsq-macro and output; corrected non-conservation of directionality; introduced weights in the score calculation
921103 - 1.3 - Changed LIst option; add STatistics option
930105 - 1.4 - Changed name to DEJAVU (at last); updated manual
930125 - 1.5 - Implemented distance options I and A; implemented incremental search for maximum common motif; option to try to avoid multiple chain hits
930126 - 1.6 - Removed some minor bugs
930222 - 1.7 - new SELECT option; avoid hits with multiple copies of the "same" protein
930302 - 1.8 - TOPOLOGY option (crummy !!!)
930713 - 2.0 - cleaned up for export; added notes on installing and running the software to this manual file
930826 - 2.1 - more info when errors occur during database read; increased array dimensions for new databases
930921 - 2.1.1 minor bug fix in SElect (needed for DEC Alphas)
930923 - - added jiffy program POST to analyse O log file
930924 - 3.0 - altered SElect command to continue cycling until you actually choose option 0 (=back to main menu); BONES search option (part of INcr); works for P2 !
930927 - 3.1 - if BONES search, check that there are > 2 SSEs; if NO directionality, use |cos| for the score; option to skip all proteins whose PDB file does not exist (actually: can not be read by the user); only include factors in score whose weight > 0.01; include centroid-LSQ-RMSD as a factor contributing to the score; new option to do either an lsq_explicit inside O, or an lsq_centroid inside DEJAVU; make lsq_improve with both complete molecules the default for the FInd option as well
931206 - 4.0 - interface with LSQMAN (through input file)
941101 - 4.1 - increased dimensioning to 2500 structures
950118 - 4.2 - sensitive to environment variable GKLIB
950718 - 4.3 - replaced "mismatch nr of residues" by two separate cut-offs for "too short" and "too long" SSEs
970102 - 5.0 - better suggested defaults for BONES searches; sort the hits (by nr of SSEs -> RMSD -> Score); reduced the amount of output generated by the program; add PDB identifier to PRINT statements in O macros to facilitate grep-ing results for a particular entry (e.g.: "grep ^print lsq.omac | grep 1ack")
970115 - - added DEJANA to sort O macros produced by DEJAVU or LSQMAN; added quick starter guide to manual and a brief description of DEJANA
970131 - 5.1 - moved a few search parameters which are rarely used to a separate PArameter command
970729 - 5.2 - LSQMAN will now also write the aligned hits to PDB files (can be switched off) - this is useful for non-O users
981020 -5.2.1- minor bug fix (RMSD not always printed in list of hits)
981127 - 5.3 - new SElect options to (de)select multiple entries; list total number of mismatched residues for every hit; list total number of gap-length differences (between neighbouring SSEs) for every hit; implemented symbolic searching where spatial arrangements of SSEs are not used, only their type and length (in terms of residues) - can be used if you get no hits at all, or if you have a very reliable secondary structure prediction

4 INTRODUCTION

In the "good old days" protein scientists made it a sport to become walking databanks of secondary structure motifs; upon seeing a particular fold, for example during a seminar, they would say: "Oh, but that fold also occurs in XXX", and, boy, did you feel stupid for having failed to notice this. Well, your worries might be coming to an end soon, thanks to DEJAVU.

DEJAVU will take a description of the secondary structure elements that occur in your particular protein and compare it to a huge database of secondary structure elements that occur in protein structures that have been published as PDB files.

What's the basic idea ? A MOTIF of secondary structure elements (henceforth abbreviated "SSEs") consists of N SSEs, each of which comprises M(i) residues and has a length of L(i) Angstrom (measured from the first residue's Calpha to that of the last residue), and which is characterised by a matrix D(i,j) which contains the centre-to-centre distances (for example) and by another matrix C(i,j) which contains the cosines of the angles made by the direction vectors of the individual elements (the direction vector goes FROM the N-terminal Calpha TO the C-terminal one). Finding a motif in the database that is SIMILAR to that which occurs in your protein then comes down to finding suitable collections of N SSEs in the structures of other proteins which have approximately the same numbers of residues, the same lengths and comparable mutual distances and direction-vector cosines.
And that is ALL there is to it !

NOTE: unless you have compelling reasons to do otherwise, you are strongly suggested to use the INcremental search option, rather than the FInd option, since the former is much less sensitive to small differences between similar structures.

NOTE: you can also use this program with "SSEs" based on a skeleton (Bones). Simply create an SSE file with dummy residue names, find the terminal CA positions by clicking on the appropriate Bones atoms & guess the number of residues as:
- N->C distance (A) divided by 1.6 A/residue for a helix
- N->C distance (A) divided by 3.4 A/residue for a strand
For more details, see: G.J. Kleywegt & T.A. Jones, "Halloween ... Masks and Bones", in "From First Map to Final Model" (S. Bailey, R. Hubbard & D. Waller, Eds.), SERC Daresbury Laboratory, Warrington (1994), pp. 59-66.

NOTE: This program is sensitive to the environment variable GKLIB. If set, the name of this directory will be prepended to the default name for the library file needed by this program. For example, in Uppsala, put the following line in your .login or .cshrc file: setenv GKLIB /nfs/public/lib

5 QUICK START GUIDE

This section briefly goes through the necessary steps of running DEJAVU - it is NOT a substitute for reading the manual.

5.1 with (CA) coordinates

* set up the programs and database as described elsewhere in this document

* run the "make_sse" script to generate an SSE file (the latest version of this script can be found in the OMAC directory)

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- % 264 gerard sarek 17:07:00 gerard/junk > make_sse make_sse - generate an SSE file for DEJAVU - gerard kleywegt NOTE: this script will ONLY work if you have the run alias set up correctly .... if you do not know what this is, ask Gerard ..................... Enter a 4 character PDB identifier for your structure > crab crab Enter the COMPLETE path and file name of your PDB file > /nfs/pdb/full/1cbs.pdb /nfs/pdb/full/1cbs.pdb Enter a comment string about your structure > crabp 2 crabp 2 Enter the name of an O database file that you own > gen.o6 gen.o6 ... running PRO1 ... [...] Removing temporary files ... SSE file crab.sse created ! ! === crab ! MOL crab NOTE crabp 2 PDB /nfs/pdb/full/1cbs.pdb ! BETA 'B1' '6' '8' 3 28.796 22.676 37.211 30.802 19.424 31.450 BETA 'B2' '11' '13' 3 29.015 12.742 24.851 24.586 12.258 19.768 ALPHA 'A1' '15' '22' 8 22.385 17.198 17.681 13.432 21.938 13.569 ALPHA 'A2' '26' '36' 11 22.755 24.444 6.742 27.253 23.416 21.312 BETA 'B3' '39' '46' 8 28.851 24.201 26.747 16.441 23.718 46.062 BETA 'B4' '49' '56' 8 12.870 24.551 42.348 27.978 28.831 24.684 BETA 'B5' '59' '65' 7 23.783 31.655 22.882 11.058 26.641 37.132 BETA 'B6' '70' '72' 3 3.912 26.102 34.059 7.826 30.063 29.644 BETA 'B7' '82' '86' 5 4.988 27.197 27.492 6.946 16.621 35.945 BETA 'B8' '92' '99' 8 14.016 11.835 35.981 4.357 26.721 21.041 BETA 'B9' '106' '113' 8 6.255 18.453 18.868 20.634 11.503 37.091 BETA 'B10' '118' '125' 8 26.242 11.413 35.458 11.634 16.962 17.923 BETA 'B11' '128' '135' 8 14.286 12.486 16.407 29.697 14.762 34.394 ENDMOL

Finished with exit status 0 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* start DEJAVU

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 265 gerard sarek 17:07:00 gerard/junk > run dejavu
   
 [...]
   
 DEJAVU SSE library file ? (/nfs/public/lib/dejavu.lib)
   
 List contents of SSE library (Y/N) ? (N)
   
 Skip non-existent PDB files  (Y/N) ? (N)
   
 [...]
   
 ===> Option ? (READ)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* read your new SSE file

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ) read
 User DEJAVU file ? (user.sse) crab.sse
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* start an INcremental search; tweak the input parameters until you get more hits than you would hope to find (we'll get rid of the poor ones later; better to find a few poor hits now, than to miss correct ones)

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ) in
   
 ********** NEW QUERY **********
   
 Elements : ( B1 B2 A1 A2 B3 B4 B5 B6 B7 B8 B9 B10 B11)
 Nr of SSEs : (      13)
 Min nr of residues for SSEs             ? (       4)
 Nr of SSEs : (      10)
 Remaining SSEs : ( A1 A2 B3 B4 B5 B7 B8 B9 B10 B11)
 Min nr of elements to match (0 = abort) ? (       4) 6
   
 Is this a BONES search ? (N)
   
 Do lsq_explicit inside O ? (N)
   
 Define how much the nr of residues in SSEs may differ
 by defining how many residues shorter or longer SSEs in
 the database may be compared to those in your protein.
 Max nr of residues "too short" ? (          2)
 Max nr of residues "too long"  ? (          4)
   
 Mismatch element length        ? (  10.000)
 Mismatch distances             ? (   8.000)
 Mismatch cosines               ? (   0.400)
   
 Weights for nr res, length, dist, cos, rmsd
 Weights for scoring     ? (   0.001    0.001    0.100    0.100    0.500)
 Normalised weights      : (   0.014    0.014    0.139    0.139    0.694)
   
 Possible distance criteria:
  C  => centre-to-centre
  H  => MIN head-tail and tail-head (anti-parallel)
  T  => MIN head-head and tail-tail (parallel)
  I  => MIN of all these distances
  A  => MAX of all these distances
 Which distances (C/H/T/I/A) ? (C)
   
 Extensive output        ? (N)
   
 Conserve directionality ? (Y)
   
 Conserve absolute motif ? (Y)
   
 Conserve neighbours     ? (N)
   
 Attempt to avoid multi-chain hits ? (N)
 Attempt to avoid identical proteins ? (N)
   
 Create O macro file      ? (Y)
 O macro file             ? (lsq.omac)
 Create LSQMAN input file ? (Y)
 LSQMAN input file        ? (lsqman.inp)
   
 [...]
   
 Sorting hits ...
   
   Nr Entry  PDB  SSE  RMSD SCORE Compound
 ==== ===== ==== ==== ===== ===== ========
    1   152 1cbs   10  0.00  0.00 cellular retinoic-acid-binding protein type ii co - human (homo sapie
    2   149 1cbi   10  1.73  1.50 mol_id: 1; - mol_id: 1;
    3   490 1hmt    9  1.31  1.15 fatty acid binding protein (human muscle, m-fabp) - organism: homo sa
    4   619 1lid    9  1.45  1.27 adipocyte lipid-binding protein complexed with ol - mouse (mus muscul
    5   759 1opb    9  1.94  1.66 cellular retinol binding protein ii (holo form) - rat (rattus rattus
    6   219 1crb    9  2.64  2.31 cellular retinol binding protein (crbp) complexed - rat (rattus rattu
    7   825 1pmp    8  1.13  1.03 p2 myelin protein (p2) - bovine (bos taurus
    8   380 1ftp    8  1.73  1.50 fatty-acid-binding protein - desert locust (sch
    9   663 1mdc    8  2.43  2.08 fatty acid binding protein (manduca sexta) (mfb2) - tobacco hornworm
   10   197 1cly    7  3.94  3.64 mol_id: 1; -
   11   715 1ncb    7  6.02  5.43 n9 neuraminidase-nc41 (e.c.3.2.1.18) mutant with - influenza virus a/
   
 [...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* when you're happy, quit the program

* it is strongly recommended to now run LSQMAN to separate the men from the boys

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 266 gerard sarek 17:07:00 gerard/junk > run lsqman < lsqman.inp > lsqman.out
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* now run DEJANA to sort out the hits you're really interested in, let it write them to a new O macro, and execute this macro from within O. The use of DEJANA is described elsewhere in this manual

5.2 without coordinates

* set up the programs and database as described elsewhere in this document

* you will have to create an SSE file. Usually, this means you have at least a set of Bones in which you can identify SSEs. Perhaps you have used ESSENS and SOLEX to get an SSE file (see the SOLEX manual for more details), for example:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! Created by SOLEX V. 961228/1.0 at Sat Dec 28 23:36:51 1996 for user gerard
!
MOL   bone
NOTE  auto-generated by SOLEX
PDB   btrace.pdb
!
BETA  'B1' ' 1' ' 12' 12 61.43 60.73 47.76 33.97 55.75 27.06
BETA  'B2' ' 13' ' 21' 9 44.24 63.08 16.44 37.40 64.56 41.58
BETA  'B3' ' 22' ' 29' 8 56.31 63.65 17.51 44.11 72.87 32.13
BETA  'B4' ' 30' ' 37' 8 49.36 51.47 27.01 61.21 66.47 37.90
BETA  'B5' ' 38' ' 45' 8 57.25 53.27 22.42 59.65 74.87 31.87
BETA  'B6' ' 46' ' 52' 7 45.76 52.50 31.42 59.24 63.58 40.97
BETA  'B7' ' 53' ' 59' 7 62.51 73.28 34.79 52.24 58.42 26.17
BETA  'B8' ' 60' ' 65' 6 47.19 65.18 19.62 39.41 67.92 33.35
ENDMOL
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* start DEJAVU and read in your SSE file

* start an INcremental search, and answer Yes to the question if this is a Bones search. Tweak the input parameters until you get more hits than you would ever want (we'll sort out the good and the bad later)

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ) in
   
 ********** NEW QUERY **********
   
 Elements : ( B1 B2 B3 B4 B5 B6 B7 B8)
 Nr of SSEs : (       8)
 Min nr of residues for SSEs             ? (       4)
 Nr of SSEs : (       8)
 Remaining SSEs : ( B1 B2 B3 B4 B5 B6 B7 B8)
 Min nr of elements to match (0 = abort) ? (       4) 6
   
 Is this a BONES search ? (N) yes
 BONES search mode
   
 BONES search; will do lsq_centroid
   
 Define how much the nr of residues in SSEs may differ
 by defining how many residues shorter or longer SSEs in
 the database may be compared to those in your protein.
 BONES suggested value: 1 or 2
 Max nr of residues "too short" ? (          2)
 BONES suggested value: 4 to 6
 Max nr of residues "too long"  ? (          4)
   
 BONES suggested value: ~10
 Mismatch element length        ? (  10.000)
 BONES suggested value: ~6
 Mismatch distances             ? (   8.000) 6
 BONES suggested value: 0.2 to 0.4
 Mismatch cosines               ? (   0.400) 0.2
   
 Weights for nr res, length, dist, cos, rmsd
 BONES suggested values: 0 0 1 1 5
 Weights for scoring     ? (   0.001    0.001    0.100    0.100    0.500) 0 0 1 1 5
 Normalised weights      : (   0.001    0.001    0.142    0.142    0.712)
   
 Possible distance criteria:
  C  => centre-to-centre
  H  => MIN head-tail and tail-head (anti-parallel)
  T  => MIN head-head and tail-tail (parallel)
  I  => MIN of all these distances
  A  => MAX of all these distances
 BONES suggested value: C !!!
 Which distances (C/H/T/I/A) ? (C)
   
 Extensive output        ? (N)
   
 BONES suggested value: NO !!!
 Conserve directionality ? (Y) no
   
 BONES suggested value: Y
 Conserve absolute motif ? (Y)
   
 BONES suggested value: NO !!!
 Conserve neighbours     ? (N) no
   
 Attempt to avoid multi-chain hits ? (N)
 Attempt to avoid identical proteins ? (N)
   
 Create O macro file      ? (Y)
 O macro file             ? (lsq.omac)
   
 [...]
   
 Nr of database entries : (       1381)
 Nr of selected entries : (       1381)
 Nr of matching entries : (         54)
 Nr of hits (total)     : (        376)
   
 Sorting hits ...
   
   Nr Entry  PDB  SSE  RMSD SCORE Compound
 ==== ===== ==== ==== ===== ===== ========
    1   380 1ftp    7  2.71  2.26 fatty-acid-binding protein - desert locust (sch
    2   825 1pmp    6  2.20  1.92 p2 myelin protein (p2) - bovine (bos taurus
    3   152 1cbs    6  2.53  2.05 cellular retinoic-acid-binding protein type ii co - human (homo sapie
    4   547 1igc    6  2.74  2.42 igg1 fab fragment complexed with protein g (domai - molecule: igg1 fa
    5   338 1fbi    6  2.86  2.52 fab fragment of the monoclonal antibody f9.13.7 ( - immunoglobulin f9
    6   619 1lid    6  2.88  2.39 adipocyte lipid-binding protein complexed with ol - mouse (mus muscul
    7   663 1mdc    6  2.93  2.57 fatty acid binding protein (manduca sexta) (mfb2) - tobacco hornworm
    8   490 1hmt    6  2.94  2.41 fatty acid binding protein (human muscle, m-fabp) - organism: homo sa
    9  1150 2cgr    6  3.01  2.61 igg2b (kappa) fab fragment complexed with antigen - mouse (mus muscul
   10   219 1crb    6  3.01  2.62 cellular retinol binding protein (crbp) complexed - rat (rattus rattu
   
 [...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

* when you're happy, quit the program

* now run DEJANA to sort out the hits you're really interested in, let it write them to a new O macro, and execute this macro from within O. The use of DEJANA is described elsewhere in this manual

6 USER INPUT FILES

6.1 description

In order to run DEJAVU you need a database file (which we provide) and a file which describes the SSEs of your protein. Here, we describe how you can make such a file yourself; later, we show how this process can be carried out completely automatically.

An (ASCII) input file consists of records which are all read in the format (A6,A) and which are supposed to contain (keyword, value) combinations. The only exception is the comment card, which has an exclamation mark ("!") in column 1 and may contain any text you like in the other columns. Comment cards are ignored when DEJAVU reads your file.

Keywords consist of 6 characters, but only the first THREE are really needed.

6.2 keywords

The important keywords are:

REMark - followed by any text; the text is printed when DEJAVU reads the file; may occur anywhere; note the difference with "!" cards

MOLecl - an identifier for the molecule, typically the PDB name which consists of four characters (we suggest you use four characters for your own proteins as well, although the name may be up to ten characters long); this record MUST preceed all of the following records !!

NOTe - a description of your protein, its source, possibly model number etc.; this record is optional

PDBfil - the name of the PDB file (please use COMPLETE path names); optional

ENDmol - another optional card to flag the end of the description of your molecule; it will force DEJAVU to print a brief summary of what is has just read from your file; if you omit this record, no such information is printed

In between the PDBfil and the ENDmol cards come the records which describe your protein's SSEs, one card per SSE. Such a card must contain the TYPE of secondary structure as the keyword. Valid type names are defined at the start of the database. Now (and in the foreseeable future), the only allowed types are 'ALPHA ' and 'BETA ' (note the trailing spaces !). The rest of the line must contain (in FREE format) in the following order:

- the NAME of the SSE (e.g., 'A3' for the third alpha helix)
- the NAME of the first residue (e.g., 'B234' for residue nr 234 in chain B of your protein); these must be O-names if you want to use O for the least-squares analysis and the graphics
- the NAME of the last residue
- the NUMBER of residues
- the X,Y,Z coordinates of the Calpha atom of the first residue
- the X,Y,Z coordinates of the Calpha atom of the last residue

6.3 example

The following example input file demonstrates the rules described above:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! Fil cro1.secs
! Dat Tue Oct 27 16:10:38 1992
! Mol 1cro
!
MOL   1cro
NOTE  cro repressor - bacteriophage (lamb
PDB   /nfs/public/pdb/cro1.pdb
!
BETA  'B1 ' 'O2' 'O5' 4 -14.281 -31.313 -18.167 -23.175 -35.450 -16.637
ALPHA 'A1 ' 'O7' 'O13' 7 -29.257 -34.194 -18.097 -28.845 -32.180 -7.967
ALPHA 'A2 ' 'O16' 'O23' 8 -34.771 -27.785 -12.919 -28.824 -24.039 -20.669
ALPHA 'A3 ' 'O27' 'O36' 10 -37.998 -24.961 -17.921 -38.897 -38.362 -23.129
BETA  'B2 ' 'O39' 'O45' 7 -29.786 -38.963 -24.270 -15.878 -26.755 -18.342
BETA  'B3 ' 'O49' 'O56' 8 -19.552 -22.759 -18.208 -26.812 -40.941 -30.956
BETA  'B4 ' 'A2' 'A5' 4 -13.971 -31.869 -27.393 -5.357 -36.922 -28.490
ALPHA 'A4 ' 'A7' 'A13' 7 0.890 -35.709 -26.997 0.486 -34.944 -37.172
ALPHA 'A5 ' 'A16' 'A23' 8 7.112 -30.676 -32.685 0.941 -25.214 -25.866
ALPHA 'A6 ' 'A27' 'A36' 10 10.231 -27.335 -28.000 10.343 -40.059 -21.413
BETA  'B5 ' 'A39' 'A45' 7 1.183 -39.887 -20.169 -11.744 -27.270 -27.497
BETA  'B6 ' 'A49' 'A56' 8 -7.815 -23.996 -28.506 -2.038 -40.811 -13.598
BETA  'B7 ' 'A61' 'A64' 4 -0.515 -49.077 -6.661 7.429 -51.625 -0.395
BETA  'B8 ' 'B2' 'B5' 4 -9.695 -42.362 -23.899 -11.331 -37.554 -32.556
ALPHA 'A7 ' 'B7' 'B13' 7 -14.598 -38.849 -38.128 -5.003 -39.984 -40.092
ALPHA 'A8 ' 'B16' 'B23' 8 -11.330 -44.668 -45.288 -16.314 -48.999 -37.181
ALPHA 'A9 ' 'B27' 'B36' 10 -16.401 -47.176 -46.990 -22.870 -34.583 -45.529
BETA  'B9 ' 'B39' 'B45' 7 -20.900 -34.390 -36.358 -10.488 -46.927 -25.771
BETA  'B10 ' 'B49' 'B56' 8 -11.541 -50.660 -29.488 -25.975 -32.563 -31.906
BETA  'B11 ' 'C2' 'C5' 4 -19.072 -41.841 -20.389 -17.236 -36.377 -12.462
ALPHA 'A10 ' 'C7' 'C13' 7 -14.059 -37.036 -6.711 -23.682 -37.697 -4.432
ALPHA 'A11 ' 'C16' 'C23' 8 -17.641 -41.442 1.004 -12.536 -47.247 -6.179
ALPHA 'A12 ' 'C27' 'C36' 10 -12.708 -44.384 3.140 -5.894 -32.347 0.006
BETA  'B12 ' 'C39' 'C45' 7 -7.596 -33.295 -8.952 -18.764 -46.131 -18.226
BETA  'B13 ' 'C49' 'C56' 8 -18.195 -49.385 -14.312 -2.019 -32.415 -13.482
ENDMOL
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The assignment of the SSEs, i.e., determining where helices and strands begin and end, can either be done by you, or within O (with the YASSPA option).

The above file, by the way, was extracted from the database by DEJAVU. It is used in some of the examples that are shown below, so if you want to rework the examples, you may want to extract this file as well (use the EXtract option in DEJAVU, then ask for molecule 1cro).

7 DATABASE

The database file (for those interested) consists of a number of 'TYPE ' cards, which define the secondary structure types that are defined, a number of entries a la the user DEJAVU file and (optionally) a 'CHAIN ' card whic points to another database file (in this way you may chain your private database to your local database and from there on to the general PDB-derived database). Note that all records FOLLOWING a CHAIN card are IGNORED (i.e., it is NOT an INCLUDE statement !!!).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
REMARK
REMARK Secondary structure database
REMARK
(...)
REMARK Version 0.7 - Gerard Kleywegt @ 921103 - first Uppsala structures included
REMARK
REMARK === list of secondary structure types that are used in this database
REMARK
TYPE   'ALPHA'  'alpha helix'
TYPE   'BETA'   'beta strand'
REMARK
REMARK === PRIVATE STRUCTURES
(...)
REMARK
REMARK === GSTA; sec structure according to ALWYN !!! NOT YASSPA !!!
REMARK
MOL    GSTA
NOTE   human class alpha glutathione S-transferase model M10A
REMARK
BETA   'B1' 'A4' 'A7'   4   83.556  32.658  -4.327   85.981  34.524   4.814
ALPHA  'A1' 'A16' 'A25'  10   88.040  22.978   5.128   83.811  20.525  -8.112
(...)
BETA   'B5' 'A203' 'A205'   3   94.355  22.919   1.194   97.646  21.706   7.281
ALPHA  'A9' 'A209' 'A218'  10  100.424  25.314  18.933   90.509  36.091  17.098
ENDMOL
(...)
REMARK
REMARK === CHAIN TO NEXT FILE
REMARK
CHAIN /home/gerard/progs/secs/libs/uppsala.secs
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

8 RUNNING THE PROGRAM

8.1 startup

When you start the program, you will see something like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 151 gerard rigel 21:42:26 progs/secs> DEJAVU
   
 *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU ***
   
 Version  - 921029/0.06
 By       - Gerard J. Kleywegt, Dept. Mol. Biology, BMC, Uppsala (S)
 User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL)
   
 Started  - Thu Oct 29 21:57:05 1992
 User     - gerard
 Mode     - interactive
 Tty      - /dev/ttyq3
   
 *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU *** DEJAVU ***
   
 Max nr of database entries             : (       1000)
 Max nr of sec-struc elements per entry : (        150)
 Max nr of sec-struc types              : (         10)
   
 DEJAVU database file ? (secs.lib)
   
 List contents of database (Y/N) ? (N)
   
 TYPE   > ALPHA  alpha helix
 TYPE   > BETA   beta strand
 Nr of lines read  : (         94)
 Nr of entries now : (          3)
 CHAIN  > /home/gerard/progs/secs/libs/pdb.secs
   
 Nr of lines read : (      20356)
 Nr of entries    : (        605)
   
 +----------------------------------------------------------+
 | OPTIONS:                                                 |
 |                                                          |
 | REad user DEJAVU file       FInd user motif in database  |
 | LIst a database entry       EXtract a database entry     |
 | CHeck database integrity    STatistics                   |
 | QUit from DEJAVU            INcremental comparison       |
 | SElect certain entries      TOpological analysis         |
 | ! (comment; no action)      ? (list options)             |
 +----------------------------------------------------------+
   
 ===> Option ? (READ)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

You are asked to supply the name of the database file and whether or not you want a listing of the contents of the database (reply "NO" to this unless you want to see 20 kilolines of output running over your screen ...). The database(s) are then loaded and the number of entries (in this case, 605) is printed. You are then presented with a menu of options:

8.2 options

! = any input beginning with "!" is ignored (this allows you to include comments in input files or scripts)
? = will result in a renewed listing of the available options
QU = will stop the program
CH = not usually needed by end-users; it checks all entries to see if there are duplicate molecule identifiers or PDB file names (this takes some time !)
LI = lists all entries which contain a certain string in their molecule identifier, note or PDB file name; you may enter the string
EX = extracts an entry from the database in a suitable format so that this file can be used as a user input file to DEJAVU
RE = read a user DEJAVU file (must be done before one uses FI)
FI = searches for secondary structure motifs; this option is discussed in detail in the following section
IN = incremental search ("find as many common SSEs as possible"); experience has shown that this is the method of choice !!!

8.3 LIst

An example of the use and output of the LIst option in which all entries which have the word "dna" in their note are listed:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ)
li
 Search on Name, Comment or Filename ? (N)
com
 Search string ? (p2)
dna
   
 MOL    > 1dpi
 NOTE   > /dna$ polymerase i (klenow fragment) (e.c.2.7.7.7 - (escherichia $col
 PDB    > /nfs/public/pdb/dpi1.pdb
 Nr of elements : (         37)
 ====== >  Nr Type   Name   From   To     Nres
 ====== >   1 ALPHA  A1     336    348      13
 ====== >   2 BETA   B1     351    358       8
 ====== >   3 BETA   B2     370    375       6
 ====== >   4 BETA   B3     380    385       6
[...]
 ====== >  35 ALPHA  A20    890    905      16
 ====== >  36 BETA   B16    913    921       9
 ====== >  37 ALPHA  A21    924    927       4
   
 MOL    > 2gn5
 NOTE   > gene 5 /dna$ binding protein - filamentous bacteri
 PDB    > /nfs/public/pdb/gn52.pdb
 Nr of elements : (          7)
 ====== >  Nr Type   Name   From   To     Nres
 ====== >   1 ALPHA  A1     11     13        3
 ====== >   2 BETA   B1     15     19        5
 ====== >   3 BETA   B2     22     24        3
 ====== >   4 BETA   B3     26     38       13
 ====== >   5 BETA   B4     42     48        7
 ====== >   6 BETA   B5     60     62        3
 ====== >   7 BETA   B6     81     84        4
   
 ===> Option ? (LI)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Note that the "notes" for the PDB-derived entries were extracted by a dumb csh-script from the COMPND and SOURCE records of the corrsponding PDB files; they have not been checked by hand and may therefore be rather incomplete !

8.4 EXtract

An example of the use of the EXtract option:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (LI)
extr
 Molecule name ? (dna)
2gn5
   
 MOL    > 2gn5
 NOTE   > gene 5 /dna$ binding protein - filamentous bacteri
 PDB    > /nfs/public/pdb/gn52.pdb
 Nr of elements : (          7)
 Filename ? (out.secs)
2gn5.secs
   
 ===> Option ? (EXTR)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Note that ALL entries which contain the string that you enter in their molecule identifier are written to files !
To show that this option really works, we show the resulting file:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 182 gerard rigel 19:04:41 progs/secs> cat 2gn5.secs
! Fil 2gn5.secs
! Dat Thu Oct 29 22:10:29 1992
! Mol 2gn5
!
MOL   2gn5
NOTE  gene 5 /dna$ binding protein - filamentous bacteri
PDB   /nfs/public/pdb/gn52.pdb
!
ALPHA 'A1 ' '11' '13' 3 9.884 15.253 22.042 8.967 11.131 19.406
BETA  'B1 ' '15' '19' 5 13.747 7.764 18.560 14.306 -3.922 13.856
BETA  'B2 ' '22' '24' 3 23.228 -7.564 9.436 22.766 -10.808 3.610
BETA  'B3 ' '26' '38' 13 18.044 -11.177 3.277 -3.221 15.221 11.399
BETA  'B4 ' '42' '48' 7 -3.554 14.308 15.412 10.385 3.316 9.016
BETA  'B5 ' '60' '62' 3 6.488 19.768 11.732 5.599 17.379 5.353
BETA  'B6 ' '81' '84' 4 7.108 8.400 4.546 10.457 17.825 5.205
ENDMOL
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

8.5 REad

An example of the use of the REad option:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (LIST)
read
 User DEJAVU file ? (user.secs)
cro1.secs
   
 MOL    > 1cro
 NOTE   > cro repressor - bacteriophage (lamb
 PDB    > /nfs/public/pdb/cro1.pdb
 ENDMOL > 1cro
 Nr of elements : (         25)
 ====== >   1 BETA   B1     O2     O5        4
 ====== >   2 ALPHA  A1     O7     O13       7
 ====== >   3 ALPHA  A2     O16    O23       8
 ====== >   4 ALPHA  A3     O27    O36      10
 ====== >   5 BETA   B2     O39    O45       7
 ====== >   6 BETA   B3     O49    O56       8
 ====== >   7 BETA   B4     A2     A5        4
 ====== >   8 ALPHA  A4     A7     A13       7
 ====== >   9 ALPHA  A5     A16    A23       8
 ====== >  10 ALPHA  A6     A27    A36      10
 ====== >  11 BETA   B5     A39    A45       7
 ====== >  12 BETA   B6     A49    A56       8
 ====== >  13 BETA   B7     A61    A64       4
 ====== >  14 BETA   B8     B2     B5        4
 ====== >  15 ALPHA  A7     B7     B13       7
 ====== >  16 ALPHA  A8     B16    B23       8
 ====== >  17 ALPHA  A9     B27    B36      10
 ====== >  18 BETA   B9     B39    B45       7
 ====== >  19 BETA   B10    B49    B56       8
 ====== >  20 BETA   B11    C2     C5        4
 ====== >  21 ALPHA  A10    C7     C13       7
 ====== >  22 ALPHA  A11    C16    C23       8
 ====== >  23 ALPHA  A12    C27    C36      10
 ====== >  24 BETA   B12    C39    C45       7
 ====== >  25 BETA   B13    C49    C56       8
   
 Nr of lines read : (         34)
 Nr of elements   : (         25)
   
 ===> Option ? (READ)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9 FINDING A MOTIF

Looking for a secondary structure motif is easy. Let's take the example we used above pertaining to lambda cro repressor. We will look for a very simple "motif" consisting only of the helix-(turn)-helix of the DNA-binding domain. Actually, since we can only look for alpha helices (and beta strands, of course) we will ignore the turn, but we will impose that any "hit" in the database must consist of two helices which are quite close together (i.e., the C-terminus of helix A2 must be close to the N-terminus of helix A3).

9.1 input

The output looks something like this (broken into small pieces and annotated):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (LI)
fi
   
 ********** NEW QUERY **********
   
 Elements : ( B1 A1 A2 A3 B2 B3 B4 A4 A5 A6 B5 B6 B7 B8 A7 A8 A9 B9 B10
 B11 A10 A11 A12 B12 B13)
 Nr of elements to match (0 = abort) ? (       2)
2
 Query element   1 ? ( A4)
A2
 Query element   2 ? ( A5)
A3
 ................... ( A2 A3)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9.2 SSEs

DEJAVU prints a list of the SSEs in your protein and wants to know how many SSEs make up your query motif. Next, you enter their names one by one (names are case-sensitive; spaces are removed by the program).

9.3 search criteria

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Mismatch nr of residues ? (          3)
2
 Mismatch element length ? (  10.000)
6
 Mismatch distances      ? (   5.000)
3
 Mismatch cosines        ? (   0.150)
.1
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Subsequently, the mismatch criteria must be entered. The first two are used for finding possible matching SSEs in database structures, the latter two for finding motifs of SSEs that have similar mutual distances and direction-vector cosines.

NOTE: from version 4.3 onward, the "mismatch nr of residues" has been replaced by *two* separate criteria, one which tells how many residues SSEs in the database proteins may be too short, and another which tells how many residues SSEs in the database proteins may be too long. This is especially useful when you use SSEs based on Bones; e.g., you found 6 residues in a helix but cannot exclude that the helix might be longer. In that case, use a "too short" cut-off of 1 or 2 residues, but a "too long" cut-off of 4 or even more residues.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Possible distance criteria:
  C  => centre-to-centre
  H  => MIN head-tail and tail-head (anti-parallel)
  T  => MIN head-head and tail-tail (parallel)
 Which distances (C/H/T) ? (H)
   
 Extensive output        ? (N)
no
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

You must decide what type of distance criterium to use. If you have a purely anti-parallel motif, you may use option "H" which compares C-term-to-N-term distances; if you have a purely parallel motif, you are better off if you use option "T" (the shortest of the N-term-to-N-term and the C-term-to-C-term distances are used).
If you have a mixed motif or all SSEs are criss-cross, then it's safest to use option "C" (centre-to-centre).
In addition, you may request extensive output, but you must be suicidal if you reply "YES" to this question !!

9.4 search constraints and O macro

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Conserve directionality ? (Y)
   
 Conserve absolute motif ? (Y)
   
 Conserve neighbours     ? (Y)
   
 Create "O" macro file   ? (Y)
   
 "O" macro file          ? (lsq.omac)
cro_lsq.omac
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The last four input items pertain to:

(1) conservation of directionality: what this boils down to is that if you say "YES" you make sure that all elements are similarly oriented. What the program does is to sort the query elements from N-term to C-term and to make sure that the matching elements of a "hit" are also ordered from N-term to C-term. In addition, the actual cosines -rather than their absolute values- are checked. If you don't use this option, you might, for example, also find that helices A3 and A2 (in THAT order) of 1cro match your query, which is fine except that they run in the wrong direction (namely, from C-term to N-term)

(2) conservation of absolute motif or merely relatively: if you say "YES", then ALL the inter-SSE distances and cosines must satisfy the corresponding mismatch criteria; if you say "NO", then they must only hold for SUBSEQUENT SSEs (i.e., the distance from SSE nr 3 to nr 2 must be okay, but that from 3 to 1 doesn't matter, etc.). For example, if you are looking for a large beta-sheet, but you are interested in beta-barrels made up of similar strands as those in your protein as well, then don't impose the absolute motif

(3) conservation of neighbours: if you say "YES" here, it merely means that if two elements are neighbours in your structure, then they must also be neighbours in the database structures. This is a rather strict criterion, and it's probably the first you want to relax if you don't find any (or enough) hits

(4) if you want, you can get an O macro file which will do some amazing tricks for you (see later) !!

9.5 output

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of elements recognised in query : (       2)
 Indices : (       3        4)
 Nr of elements of each type : (       2        0)
   
 ********** 1cro       **********
 [cro repressor - bacteriophage (lamb                                   ]
 [/nfs/public/pdb/cro1.pdb                                              ]
 QUERY    : (       3        4)
 Elements :    A2       A3
 Lengths  : (  10.462   14.405)
 Residues : (       8       10)
   
 MATCH    : (       3        4)
 Elements :    A2       A3
 Lengths  : (  10.462   14.405)
 Residues : (       8       10)
 Length   ... rmsd =      0.000 ... match =      1.000
 Residues ... rmsd =      0.000 ... match =      1.000
 Distance ... rmsd =      0.000 ... match =      1.000
 Cosines  ... rmsd =      0.000 ... match =      1.000
 SCORE : (   0.000)
   
 MATCH    : (       9       10)
 Elements :    A5       A6
 Lengths  : (  10.696   14.328)
 Residues : (       8       10)
 Length   ... rmsd =      0.174 ... match =      1.000
 Residues ... rmsd =      0.000 ... match =      1.000
 Distance ... rmsd =      0.144 ... match =      1.000
 Cosines  ... rmsd =      0.064 ... match =      1.000
 SCORE : (   0.383)
   
 MATCH    : (      16       17)
 Elements :    A8       A9
 Lengths  : (  10.456   14.233)
 Residues : (       8       10)
 Length   ... rmsd =      0.122 ... match =      1.000
 Residues ... rmsd =      0.000 ... match =      1.000
 Distance ... rmsd =      0.356 ... match =      1.000
 Cosines  ... rmsd =      0.030 ... match =      1.000
 SCORE : (   0.509)
   
 MATCH    : (      22       23)
 Elements :    A11      A12
 Lengths  : (  10.552   14.182)
 Residues : (       8       10)
 Length   ... rmsd =      0.170 ... match =      1.000
 Residues ... rmsd =      0.000 ... match =      1.000
 Distance ... rmsd =      0.129 ... match =      1.000
 Cosines  ... rmsd =      0.017 ... match =      1.000
 SCORE : (   0.316)
 Nr of best match : (       1)
 Best score       : (   0.000)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9.6 algorithm

The program prints the SSEs it's going to look for and starts scanning the database. For each entry in the database, DEJAVU does the following:

(1) are there enough SSEs ?

(2) are there enough SSEs of each type (alpha, beta) ?

(3) find all possibly matching SSEs in the database structure for ALL of the elements in the query; if there aren't any for even one of the query elements, the database structure is skipped. Matching occurs by comparing type, number of residues and length of the SSEs

(4) ALL possible combinations of matching SSEs in the query and the database entry are generated which completely satisfy ALL criteria outlined earlier (distances, cosines, absolute or relative motif, directionality and neighbours)

(5) all the hits are printed and compared with the query; the matching SSEs are listed and some RMS-deviations are computed (don't worry about the match factors in the output); these are all combined into a final score; the score is 0.0 for a perfect match (see A2-A3 above which is identical to the query); the higher the score, the poorer the match

(6) for each protein which produced hits, the one with the lowest score is used to create some O instructions in the O macro file; in the example above, 1cro itself produced 4 very good hits because there are four monomers in the PDB file; note that the motif we are looking for scores 0.00

9.7 more hits

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ********** 1lap       **********
 [leucine aminopeptidase (e.c.3.4.11.1) - bovine (bos $taurus           ]
 [/nfs/public/pdb/lap1.pdb                                              ]
 QUERY    : (       3        4)
 Elements :    A2       A3
 Lengths  : (  10.462   14.405)
 Residues : (       8       10)
   
 MATCH    : (      31       32)
 Elements :    A16      A17
 Lengths  : (   9.916   17.758)
 Residues : (       7       12)
 Length   ... rmsd =      2.402 ... match =      0.993
 Residues ... rmsd =      1.581 ... match =      0.989
 Distance ... rmsd =      0.797 ... match =      1.000
 Cosines  ... rmsd =      0.033 ... match =      1.000
 SCORE : (   4.864)
 Nr of best match : (       1)
 Best score       : (   4.864)
   
 ********** 1trc       **********
 [calmodulin (/tr=2=c$ fragment comprising residues - bull (bos $taurus]
 [/nfs/public/pdb/trc1.pdb                                              ]
 QUERY    : (       3        4)
 Elements :    A2       A3
 Lengths  : (  10.462   14.405)
 Residues : (       8       10)
   
 MATCH    : (       4        5)
 Elements :    A3       A4
 Lengths  : (   9.351   14.741)
 Residues : (       8       10)
 Length   ... rmsd =      0.821 ... match =      0.998
 Residues ... rmsd =      0.000 ... match =      1.000
 Distance ... rmsd =      0.187 ... match =      1.000
 Cosines  ... rmsd =      0.005 ... match =      1.000
 SCORE : (   1.016)
 Nr of best match : (       1)
 Best score       : (   1.016)
   
 ===> Option ? (FI)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

So, we found "hits" with three different proteins. In this case, we used rather strict criteria in order to restrict the output a bit; if you relax the criteria somewhat, you get many more hits.

10 DEJANA

If you have coordinates for your search model (at least CA atoms), and if you have the PDB files of the hits on a local disk, you are strongly advised to run LSQMAN first, and to use DEJANA to screen the O macro produced by LSQMAN.

Otherwise, you can use DEJANA directly on the O macro produced by DEJAVU. DEJANA reads an DEJAVU or LSQMAN O macro, and allows you to apply cut-offs to get rid of unwanted (poor) hits.

For example, in case of a Bones search, the program can be used directly on the O macro produced by DEJAVU:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- % 274 gerard sarek 18:14:59 gerard/junk > run dejana [...] Name of O macro (from DEJAVU or LSQMAN) ? (lsqman.omac) lsq.omac Reading hits ... # 1 ID 1acy Nres 6 RMSD 4.08 A # 2 ID 1baf Nres 6 RMSD 4.10 A [...] # 54 ID 7tim Nres 6 RMSD 3.67 A Nr of hits (> 0 residues/SSEs) : ( 54) ------------------------------------------ Min nr of matched residues/SSEs ? ( 1) Max RMSD of matched residues/SSEs ? ( 999.990) Sorting hits ... Nr of hits left : ( 54) # 1 ID 1ftp Nres 7 RMSD 2.71 A # 2 ID 1pmp Nres 6 RMSD 2.20 A # 3 ID 1cbs Nres 6 RMSD 2.53 A # 4 ID 1igc Nres 6 RMSD 2.74 A # 5 ID 1fbi Nres 6 RMSD 2.86 A [...] # 54 ID 1for Nres 6 RMSD 5.90 A Select one of the following options: 0 = re-enter criteria and re-sort 1 = write new O macro with current hits 2 = quit program without writing new O macro Option (0, 1, 2) ? ( 0) ------------------------------------------ Min nr of matched residues/SSEs ? ( 1) 6 Max RMSD of matched residues/SSEs ? ( 999.990) 3.5 Sorting hits ... Nr of hits left : ( 19) # 1 ID 1ftp Nres 7 RMSD 2.71 A # 2 ID 1pmp Nres 6 RMSD 2.20 A # 3 ID 1cbs Nres 6 RMSD 2.53 A # 4 ID 1igc Nres 6 RMSD 2.74 A # 5 ID 1fbi Nres 6 RMSD 2.86 A # 6 ID 1lid Nres 6 RMSD 2.88 A # 7 ID 1mdc Nres 6 RMSD 2.93 A # 8 ID 1hmt Nres 6 RMSD 2.94 A # 9 ID 2cgr Nres 6 RMSD 3.01 A # 10 ID 1crb Nres 6 RMSD 3.01 A # 11 ID 1iai Nres 6 RMSD 3.03 A # 12 ID 1rmf Nres 6 RMSD 3.03 A # 13 ID 1svb Nres 6 RMSD 3.05 A # 14 ID 1bbj Nres 6 RMSD 3.11 A # 15 ID 1opb Nres 6 RMSD 3.14 A # 16 ID 1eap Nres 6 RMSD 3.21 A # 17 ID 1mcp Nres 6 RMSD 3.23 A # 18 ID 1tet Nres 6 RMSD 3.31 A # 19 ID 1dbb Nres 6 RMSD 3.45 A Select one of the following options: 0 = re-enter criteria and re-sort 1 = write new O macro with current hits 2 = quit program without writing new O macro Option (0, 1, 2) ? ( 0) 1 New O macro file ? (dejana.omac) dejana_bones.omac Writing hits ... Processing PDB code : (1ftp) Processing PDB code : (1pmp) [...] New O macro written ...

[...] ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Example of a case where coordinates were used:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- % 274 gerard sarek 18:14:59 gerard/junk > run dejana [...] Maximum number of hits : ( 2500) Name of O macro (from DEJAVU or LSQMAN) ? (lsqman.omac) lsq_crab.omac Reading hits ... # 1 ID 1ACY Nres 26 RMSD 1.99 A # 2 ID 1AMP Nres 16 RMSD 3.45 A [...] # 52 ID 8FAB Nres 16 RMSD 2.14 A Nr of hits (> 0 residues/SSEs) : ( 52) ------------------------------------------ Min nr of matched residues/SSEs ? ( 1) Max RMSD of matched residues/SSEs ? ( 999.990) Sorting hits ... Nr of hits left : ( 52) # 1 ID 1CBS Nres 137 RMSD 0.00 A # 2 ID 1CBI Nres 130 RMSD 0.86 A # 3 ID 1OPB Nres 123 RMSD 1.35 A # 4 ID 1CRB Nres 123 RMSD 1.36 A # 5 ID 1HMT Nres 121 RMSD 1.36 A # 6 ID 1LID Nres 120 RMSD 1.44 A # 7 ID 1FTP Nres 120 RMSD 1.69 A # 8 ID 1PMP Nres 119 RMSD 1.37 A # 9 ID 1MDC Nres 105 RMSD 2.06 A # 10 ID 1EPA Nres 66 RMSD 1.97 A # 11 ID 1NSN Nres 43 RMSD 2.64 A [...] # 51 ID 1NMB Nres 8 RMSD 1.79 A # 52 ID 7FAB Nres 5 RMSD 0.44 A Select one of the following options: 0 = re-enter criteria and re-sort 1 = write new O macro with current hits 2 = quit program without writing new O macro Option (0, 1, 2) ? ( 0) 0 ------------------------------------------ Min nr of matched residues/SSEs ? ( 1) 100 Max RMSD of matched residues/SSEs ? ( 999.990) 3 Sorting hits ... Nr of hits left : ( 9) # 1 ID 1CBS Nres 137 RMSD 0.00 A # 2 ID 1CBI Nres 130 RMSD 0.86 A # 3 ID 1OPB Nres 123 RMSD 1.35 A # 4 ID 1CRB Nres 123 RMSD 1.36 A # 5 ID 1HMT Nres 121 RMSD 1.36 A # 6 ID 1LID Nres 120 RMSD 1.44 A # 7 ID 1FTP Nres 120 RMSD 1.69 A # 8 ID 1PMP Nres 119 RMSD 1.37 A # 9 ID 1MDC Nres 105 RMSD 2.06 A Select one of the following options: 0 = re-enter criteria and re-sort 1 = write new O macro with current hits 2 = quit program without writing new O macro Option (0, 1, 2) ? ( 0) 1 New O macro file ? (dejana.omac) dejana_crab.omac Writing hits ... Processing PDB code : (1CBS) Processing PDB code : (1CBI) Processing PDB code : (1OPB) Processing PDB code : (1CRB) Processing PDB code : (1HMT) Processing PDB code : (1LID) Processing PDB code : (1FTP) Processing PDB code : (1PMP) Processing PDB code : (1MDC) New O macro written ...

[...] ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

11 ANALYSING THE RESULTS

11.1 O macro

NOTE: from version 5.0 onwards, one would use the accompanying program DEJANA to sort out the hits, and save only the most promising ones to a new O macro.

Analysing and evaluating the "hits" is best done in O. The previous example resulted in the following O macro:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 187 gerard rigel 19:04:41 progs/secs> cat cro_lsq.omac
! "O" macro cro_lsq.omac
! created by DEJAVU                 at Thu Oct 29 22:27:18 1992
!
print ... analysing 1cro
print cro repressor - bacteriophage (lamb
print ... query  A2     A3
print ... allowed mismatches 2 6.000 3.000 0.100
print ... distance type H
print ... directionality Y
print ... absolute motif Y
print ... neighbours Y
!
s_a_i /nfs/public/pdb/cro1.pdb 1cro
mol 1cro obj c1cro
pai_zo 1cro ; yellow
pai_zo 1cro O16    O23    green
pai_zo 1cro O27    O36    green
ca ; end
cent_id term_id 1cro O16    CA ;
!
db_set_dat .lsq_integer 1 1 50
db_set_dat .lsq_integer 2 4 4
db_set_dat .lsq_integer 3 3 16999999
!
o_setup off off on
!
!
print ... comparing 1cro
print cro repressor - bacteriophage (lamb
print ... score = 0.0000000E+00
!
s_a_i /nfs/public/pdb/cro1.pdb 1cro pdb
!
lsq_expl 1cro 1cro
O16    O23    CA
O16
O27    O36    CA
O27
; 1cro_to_1cro
!
lsq_impr 1cro_to_1cro 1cro ; 1cro ; CA 1cro_to_1cro
!
lsq_mol 1cro_to_1cro 1cro ;
!
mol 1cro obj c1cro
pai_zo 1cro ; blue
pai_zo 1cro O16    O23    red
pai_zo 1cro O27    O36    red
ca ; end
!
!
print ... comparing 1lap
print leucine aminopeptidase (e.c.3.4.11.1) - bovine (bos $taurus
print ... score = 4.864332
!
s_a_i /nfs/public/pdb/lap1.pdb 1lap pdb
!
lsq_expl 1cro 1lap
O16    O23    CA
404
O27    O36    CA
428
; 1lap_to_1cro
!
lsq_impr 1lap_to_1cro 1cro ; 1lap ; CA 1lap_to_1cro
!
lsq_mol 1lap_to_1cro 1lap ;
!
mol 1lap obj c1lap
pai_zo 1lap ; blue
pai_zo 1lap 404    410    red
pai_zo 1lap 428    439    red
ca ; end
!
!
print ... comparing 1trc
print calmodulin (/tr=2=c$ fragment comprising residues - bull (bos $taurus
print ... score = 1.016416
!
s_a_i /nfs/public/pdb/trc1.pdb 1trc pdb
!
lsq_expl 1cro 1trc
O16    O23    CA
A103
O27    O36    CA
A118
; 1trc_to_1cro
!
lsq_impr 1trc_to_1cro 1cro ; 1trc ; CA 1trc_to_1cro
!
lsq_mol 1trc_to_1cro 1trc ;
!
mol 1trc obj c1trc
pai_zo 1trc ; blue
pai_zo 1trc A103   A110   red
pai_zo 1trc A118   A127   red
ca ; end
!
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

11.2 running O

Let's run O and execute this macro (the output of the fitting of 1cro onto itself has been omitted):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 190 gerard rigel 23:08:50 secs/database> 4d_ono general.o
  O > Use of this program implies acceptance of conditions
  O > described in Appendix 10 of the O manual
  O > O version 5.8, Sat Sep 26 13:59:06 MET 1992
  O > Loading general.o
  O > Maximum inter-residue link distance = 6.00
  O >  There were   23 residues.
  O >              113 atoms.
  O > Do you want to use the display? [Yes]:
  O > Graphics board GL4DXG-4.0
  O >   O >  trackball on (F7KEY)
  O >  trackball off (F7KEY)
@cro_lsq.omac
  O > Macro in computer file-system.
 As4> ... analysing 1cro
  O >  As4> cro repressor - bacteriophage (lamb
  O >  As4> ... query  A2     A3
  O >  As4> ... allowed mismatches 2 6.000 3.000 0.100
  O >  As4> ... distance type H
  O >  As4> ... directionality Y
  O >  As4> ... absolute motif Y
  O >  As4> ... neighbours Y
  O >   O >  Sam> File type is PDB
 Sam>  Database compressed.
 Sam> Molecule 1CRO contained 264 residues and 264 atoms
  O >   O >   O >   O >   O >   O >   O >   O >   O >   O >   O >   O >
  O >   O >  As4> ... comparing 1cro
[...]
  O >   O >   O >   O >   O >   O >   O >   O >   O >   O >   O >
 As4> ... comparing 1lap
  O >  As4> leucine aminopeptidase (e.c.3.4.11.1) - bovine (bos $taurus
  O >  As4> ... score = 4.864332
  O >   O >  Sam> File type is PDB
 Sam>  Database compressed.
 Sam> Molecule 1LAP contained 483 residues and 4491 atoms
  O >  PDB          is not a visible command.
  O >   O >  Lsq > Now define what atoms in A [=1CRO] are to be matched to B [=1LAP]
 Lsq > Defining 3 names in 1CRO implies a zone and an atom name.
 Lsq > Defining 2 names in 1CRO implies a zone and CA atoms.
 Lsq > Defining 1 name in 1CRO implies the CA of that residue.
 Lsq > Molecule 1LAP just requires the start residue and atom name.
 Lsq > A blank line terminates input.
 Lsq > Define atoms from 1CRO (the not rotated molecule):  Lsq > Define atoms
 from 1LAP (the rotated molecule):  Lsq > Define atoms from 1CRO (the not rotated
 molecule):  Lsq > Define atoms from 1LAP (the rotated molecule):  Lsq > Define
 atoms from 1CRO (the not rotated molecule):  Lsq > The 18 atoms have an r.m.s.
 fit of 5.768
 Lsq >  xyz(1) =     0.9571*x+    0.1367*y+   -0.2555*z+ -112.0573
 Lsq >  xyz(2) =     0.2552*x+    0.0197*y+    0.9667*z+  -70.0792
 Lsq >  xyz(3) =     0.1371*x+   -0.9904*y+   -0.0160*z+   33.9509
 Lsq > The transformation can be stored in O.
 Lsq > A blank is taken to mean do not store anything
 Lsq > The transformation will be stored in .LSQ_RT_  O >   O >  Lsq > Least
 squares match by Semi Automatic Alignment.
 Lsq > What is the name of molecule B [1LAP  ]?  Lsq > Number of atoms in A/B
 to look for alignment   264  481
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of     8 residues located.
 Lsq >  Loop =    1 ,r.m.s. fit =     0.346 with     8 atoms
 Lsq >  x(1) =     0.9335*x+   -0.2296*y+    0.2756*z+  -97.8013
 Lsq >  x(2) =    -0.3366*x+   -0.2957*y+    0.8940*z+   -6.6633
 Lsq >  x(3) =    -0.1238*x+   -0.9273*y+   -0.3533*z+   54.2608
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    14 residues located.
 Lsq >  Loop =    2 ,r.m.s. fit =     2.143 with    14 atoms
 Lsq >  x(1) =     0.1328*x+   -0.9509*y+   -0.2794*z+   18.4068
 Lsq >  x(2) =    -0.2737*x+   -0.3061*y+    0.9118*z+   -9.3083
 Lsq >  x(3) =    -0.9526*x+   -0.0446*y+   -0.3009*z+   58.7248
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    15 residues located.
 Lsq > A fragment of     6 residues located.
 Lsq >  Loop =    3 ,r.m.s. fit =     2.612 with    21 atoms
 Lsq >  x(1) =     0.0871*x+   -0.9605*y+   -0.2645*z+   22.0105
 Lsq >  x(2) =    -0.2722*x+   -0.2783*y+    0.9211*z+  -11.2710
 Lsq >  x(3) =    -0.9583*x+   -0.0082*y+   -0.2857*z+   56.8081
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    15 residues located.
 Lsq > A fragment of     6 residues located.
 Lsq >  Loop =    4 ,r.m.s. fit =     2.612 with    21 atoms
 Lsq >  x(1) =     0.0871*x+   -0.9605*y+   -0.2645*z+   22.0105
 Lsq >  x(2) =    -0.2722*x+   -0.2783*y+    0.9211*z+  -11.2710
 Lsq >  x(3) =    -0.9583*x+   -0.0082*y+   -0.2857*z+   56.8081
 Lsq > The transformation can be stored in O.
 Lsq > A blank is taken to mean do not store anything
 Lsq > The transformation will be stored in .LSQ_RT_ Lsq > Here are the fragments
 used in the alignment
 Lsq > 0   O23 LGVYQSAINKAIHAG    O37
 Lsq >     425 RSAGACTAAAFLKEF    439
 Lsq > 0   O39 KIFLTI    O44
 Lsq >     326 IQVDNT    331
  O >   O >   O >   O >   O >   O >   O >   O >   O >   O >   O >  As4> ... comparing
 1trc
  O >  As4> calmodulin (/tr=2=c$ fragment comprising residues - bull (bos $tau
  O >  As4> ... score = 1.016416
  O >   O >  Sam> File type is PDB
 Sam>  Database compressed.
 Sam> Molecule 1TRC contained 140 residues and 1089 atoms
  O >  PDB          is not a visible command.
  O >   O >  Lsq > Now define what atoms in A [=1CRO] are to be matched to B [=1TRC]
 Lsq > Defining 3 names in 1CRO implies a zone and an atom name.
 Lsq > Defining 2 names in 1CRO implies a zone and CA atoms.
 Lsq > Defining 1 name in 1CRO implies the CA of that residue.
 Lsq > Molecule 1TRC just requires the start residue and atom name.
 Lsq > A blank line terminates input.
 Lsq > Define atoms from 1CRO (the not rotated molecule):  Lsq > Define atoms from
 1TRC (the rotated molecule):  Lsq > Define atoms from 1CRO (the not rotated molecule):
  Lsq > Define atoms from 1TRC (the rotated molecule):  Lsq > Define atoms from 1CRO
 (the not rotated molecule):  Lsq > The 18 atoms have an r.m.s. fit of 2.956
 Lsq >  xyz(1) =     0.0832*x+   -0.6134*y+   -0.7854*z+   62.0348
 Lsq >  xyz(2) =     0.5658*x+    0.6778*y+   -0.4695*z+  -22.2287
 Lsq >  xyz(3) =     0.8204*x+   -0.4053*y+    0.4034*z+  -91.4498
 Lsq > The transformation can be stored in O.
 Lsq > A blank is taken to mean do not store anything
 Lsq > The transformation will be stored in .LSQ_RT_  O >   O >  Lsq > Least squares
 match by Semi Automatic Alignment.
 Lsq > What is the name of molecule B [1TRC  ]?  Lsq > Number of atoms in A/B to look
 for alignment   264  140
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    15 residues located.
 Lsq > A fragment of    10 residues located.
 Lsq >  Loop =    1 ,r.m.s. fit =     2.363 with    25 atoms
 Lsq >  x(1) =     0.1272*x+   -0.5979*y+   -0.7914*z+   60.8691
 Lsq >  x(2) =     0.6057*x+    0.6787*y+   -0.4153*z+  -29.7156
 Lsq >  x(3) =     0.7854*x+   -0.4266*y+    0.4485*z+  -93.8586
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    15 residues located.
 Lsq > A fragment of    10 residues located.
 Lsq >  Loop =    2 ,r.m.s. fit =     2.363 with    25 atoms
 Lsq >  x(1) =     0.1272*x+   -0.5979*y+   -0.7914*z+   60.8691
 Lsq >  x(2) =     0.6057*x+    0.6787*y+   -0.4153*z+  -29.7156
 Lsq >  x(3) =     0.7854*x+   -0.4266*y+    0.4485*z+  -93.8586
 Lsq > The transformation can be stored in O.
 Lsq > A blank is taken to mean do not store anything
 Lsq > The transformation will be stored in .LSQ_RT_ Lsq > Here are the fragments
 used in the alignment
 Lsq > 0   O13 RFGQTKTAKD    O22
 Lsq >     A99 YISAAELRHV   A108
 Lsq > 0   O23 LGVYQSAINKAIHAG    O37
 Lsq >    A114 EKLTDEEVDEMIREA   A128
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

11.3 analysis on the display

If we now check the displayed objects, we notice that the fit with calmodulin is quite reasonable (rms = 2.4 A for 25 atoms; helix E of the calcium- binding EF-hand has been matched with helix A3 of lambda cro repressor).
However, for leucine aminopeptidase the fit is not so good. In this case, only one helix overlaps with one of cro. This is an example where the lsq_improve option in O actually makes things worse (for our purposes, at least). If we re-do the lsq_explicit from the macro and redraw the chain, the visual fit is improved. The fit is still relatively poor, but the MOTIF is really there: a helix, a long loop and another helix with roughly the same orientation as that of the helices in cro. And this is of course the crux of DEJAVU: even though the sequence homology may be zero and the rms-fit of the Calpha-atoms may be high, you still get to see motifs which are "spatially similar" !!! So, the extremely simplistic description of SSEs (basically, through six coordinates) works to the advantage of the performance of the program !

Again, we used very strict criteria in this example and therefore we only got two hits. If you relax them a bit you get dozens of potential (DNA-binding ???) helix-whatever-helix motifs. If you do this and you plot all of the "hits" you typically get a nice clustering of red SSEs on your screen (the colour of the matched SSEs) from a collection of widely different proteins.

12 A REALISTIC EXAMPLE

Let's do some more serious work. We have reasons to believe that the B1-A1-B2 plus the B3-B4-A3 motifs of human class alpha glutathione S-transferase might constitute a glutathione-binding domain. Are there similar motifs in the database, preferably of proteins that bind glutathione ? Well, let's find out:

12.1 SSE file

First, we create and read our DEJAVU file for GSTA:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ)
   
 User DEJAVU file ? (user.secs)
gsta.secs
   
 REMARK >  === GSTA; sec structure according to ALWYN !!! NOT YASSPA !!!
 MOL    > gsta
 NOTE   > human class alpha glutathione s-transferase model m10a
 ENDMOL > gsta
 Nr of elements : (         14)
 ====== >   1 BETA   B1     A4     A7        4
 ====== >   2 ALPHA  A1     A16    A25      10
 ====== >   3 BETA   B2     A27    A35       9
 ====== >   4 ALPHA  A2     A37    A46      10
 ====== >   5 BETA   B3     A56    A58       3
 ====== >   6 BETA   B4     A62    A65       4
 ====== >   7 ALPHA  A3     A67    A78      12
 ====== >   8 ALPHA  A4     A85    A110     26
 ====== >   9 ALPHA  A5     A113   A141     29
 ====== >  10 ALPHA  A6     A154   A169     16
 ====== >  11 ALPHA  A7     A178   A189     12
 ====== >  12 ALPHA  A8     A191   A197      7
 ====== >  13 BETA   B5     A203   A205      3
 ====== >  14 ALPHA  A9     A209   A218     10
   
 Nr of lines read : (         21)
 Nr of elements   : (         14)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

12.2 search parameters

Then we enter the search parameters:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ********** NEW QUERY **********
   
 Elements : ( B1 A1 B2 A2 B3 B4 A3 A4 A5 A6 A7 A8 B5 A9)
 Nr of elements to match (0 = abort) ? (       0)
6
 Query element   1 ? ()
B1
 Query element   2 ? ()
A1
 Query element   3 ? ()
B2
 Query element   4 ? ()
B3
 Query element   5 ? ()
B4
 Query element   6 ? ()
A3
 ................... ( B1 A1 B2 B3 B4 A3)
 Mismatch nr of residues ? (          3)
4
 Mismatch element length ? (  10.000)
13
 Mismatch distances      ? (   5.000)
10
 Mismatch cosines        ? (   0.150)
0.4
   
 Possible distance criteria:
  C  => centre-to-centre
  H  => MIN head-tail and tail-head (anti-parallel)
  T  => MIN head-head and tail-tail (parallel)
 Which distances (C/H/T) ? (C)
c
 Extensive output        ? (N)
   
 Conserve directionality ? (Y)
   
 Conserve absolute motif ? (Y)
   
 Conserve neighbours     ? (Y)
n
 Create "O" macro file   ? (Y)
   
 "O" macro file          ? (lsq.omac)
gsta_lsq.omac
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

12.3 output

And then we watch the results (the "trivial hit", namely GSTA itself) has been omitted from the output:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of elements recognised in query : (       6)
 Indices : (       1        2        3        5        6        7)
 Nr of elements of each type : (       2        4)
   
 ********** 1gp1       **********
 [glutathione peroxidase (e.c.1.11.1.9) - bovine (bos $taurus           ]
 [/nfs/public/pdb/gp11.pdb                                              ]
 QUERY    : (       1        2        3        5        6        7)
 Elements :    B1       A1       B2       B3       B4       A3
 Lengths  : (   9.640   14.114   24.862    6.844    9.271   16.715)
 Residues : (       4       10        9        3        4       12)
   
 MATCH    : (       4        5        7       14       15       17)
 Elements :    B3       A2       B4       B9       B10      A7
 Lengths  : (  22.528   20.107   22.531   19.264   18.742   10.189)
 Residues : (       8       14        8        7        7        8)
 Length   ... rmsd =      9.074 ... match =      0.892
 Residues ... rmsd =      3.512 ... match =      0.922
 Distance ... rmsd =      2.407 ... match =      0.978
 Cosines  ... rmsd =      0.148 ... match =      0.985
 SCORE : (  16.672)
   
 MATCH    : (      20       21       23       29       30       32)
 Elements :    B13      A8       B14      B18      B19      A13
 Lengths  : (  22.630   19.887   22.532   16.943   10.320   10.139)
 Residues : (       8       14        8        6        4        8)
 Length   ... rmsd =      7.680 ... match =      0.906
 Residues ... rmsd =      3.109 ... match =      0.932
 Distance ... rmsd =      2.432 ... match =      0.980
 Cosines  ... rmsd =      0.155 ... match =      0.984
 SCORE : (  14.560)
 Nr of best match : (       2)
 Best score       : (  14.560)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

And, voila, the only hit (other than GSTA itself) is glutathione peroxidase !!! In fact, there are two possible matches ! Since the O macro only contains instructions for the one with the lowest score, but we want to look at both, we LIst this entry in order to edit the macro a bit and produce both matches on the screen:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (FI)
li
 Search on Name, Comment or Filename ? (N)
n
 Search string ? (p2)
1gp1
   
 MOL    > 1gp1
 NOTE   > glutathione peroxidase (e.c.1.11.1.9) - bovine (bos $taurus
 PDB    > /nfs/public/pdb/gp11.pdb
 Nr of elements : (         32)
 ====== >  Nr Type   Name   From   To     Nres
 ====== >   1 BETA   B1     A15    A17       3
 ====== >   2 BETA   B2     A25    A27       3
 ====== >   3 ALPHA  A1     A29    A31       3
 ====== >   4 BETA   B3     A35    A42       8
 ====== >   5 ALPHA  A2     A48    A61      14
 ====== >   6 ALPHA  A3     A63    A65       3
 ====== >   7 BETA   B4     A67    A74       8
 ====== >   8 ALPHA  A4     A85    A93       9
 ====== >   9 BETA   B5     A100   A102      3
 ====== >  10 BETA   B6     A106   A108      3
 ====== >  11 BETA   B7     A111   A113      3
 ====== >  12 ALPHA  A5     A120   A128      9
 ====== >  13 BETA   B8     A150   A152      3
 ====== >  14 BETA   B9     A160   A166      7
 ====== >  15 BETA   B10    A170   A176      7
 ====== >  16 ALPHA  A6     A181   A183      3
 ====== >  17 ALPHA  A7     A185   A192      8
 ====== >  18 BETA   B11    B15    B18       4
 ====== >  19 BETA   B12    B25    B27       3
 ====== >  20 BETA   B13    B35    B42       8
 ====== >  21 ALPHA  A8     B48    B61      14
 ====== >  22 ALPHA  A9     B63    B65       3
 ====== >  23 BETA   B14    B67    B74       8
 ====== >  24 ALPHA  A10    B85    B93       9
 ====== >  25 BETA   B15    B100   B104      5
 ====== >  26 BETA   B16    B106   B108      3
 ====== >  27 ALPHA  A11    B120   B128      9
 ====== >  28 BETA   B17    B150   B152      3
 ====== >  29 BETA   B18    B161   B166      6
 ====== >  30 BETA   B19    B173   B176      4
 ====== >  31 ALPHA  A12    B181   B183      3
 ====== >  32 ALPHA  A13    B185   B192      8
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Of course, the two matches occur with each of the two monomers in the dimer, but since the assignments of the SSEs are slightly different, we still produce both matches.

12.4 O macro

The resulting O macro looks like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 194 gerard rigel 23:08:50 secs/database> cat gsta_lsq.omac
! "O" macro gsta_lsq.omac
! created by DEJAVU                 at Thu Oct 29 23:46:17 1992
!
print ... analysing gsta
print human class alpha glutathione s-transferase model m10a
print ... query  B1     A1     B2     B3     B4     A3
print ... allowed mismatches 4 13.000 10.000 0.400
print ... distance type C
print ... directionality Y
print ... absolute motif Y
print ... neighbours N
!
mol gsta obj xgsta
pai_zo gsta ; yellow
pai_zo gsta A4     A7     green
pai_zo gsta A16    A25    green
pai_zo gsta A27    A35    green
pai_zo gsta A56    A58    green
pai_zo gsta A62    A65    green
pai_zo gsta A67    A78    green
ca ; end
cent_id term_id gsta A4     CA ;
!
db_set_dat .lsq_integer 1 1 50
db_set_dat .lsq_integer 2 4 4
db_set_dat .lsq_integer 3 3 16999999
!
o_setup off off on
!
!
print ... comparing 1gp1
print glutathione peroxidase (e.c.1.11.1.9) - bovine (bos $taurus
print ... score = 14.55962
!
s_a_i /nfs/public/pdb/gp11.pdb 1gp1 pdb
!
lsq_expl gsta 1gp1
A4     A7     CA
B35
A16    A25    CA
B48
A27    A35    CA
B67
A56    A58    CA
B161
A62    A65    CA
B173
A67    A78    CA
B185
; 1gp1_to_gsta
!
lsq_impr 1gp1_to_gsta gsta ; 1gp1 ; CA 1gp1_to_gsta
!
lsq_mol 1gp1_to_gsta 1gp1 ;
!
mol 1gp1 obj c1gp1
pai_zo 1gp1 ; blue
pai_zo 1gp1 B35    B42    red
pai_zo 1gp1 B48    B61    red
pai_zo 1gp1 B67    B74    red
pai_zo 1gp1 B161   B166   red
pai_zo 1gp1 B173   B176   red
pai_zo 1gp1 B185   B192   red
ca ; end
!
!
s_a_i /nfs/public/pdb/gp11.pdb xgp1 pdb
!
lsq_expl gsta xgp1
A4     A7     CA
A35
A16    A25    CA
A48
A27    A35    CA
A67
A56    A58    CA
A160
A62    A65    CA
A170
A67    A78    CA
A185
; xgp1_to_gsta
!
lsq_impr xgp1_to_gsta gsta ; xgp1 ; CA xgp1_to_gsta
!
lsq_mol xgp1_to_gsta xgp1 ;
!
mol 1gp1 obj cxgp1
pai_zo xgp1 ; blue
pai_zo xgp1 A35    A42    red
pai_zo xgp1 A48    A61    red
pai_zo xgp1 A67    A74    red
pai_zo xgp1 A160   A166   red
pai_zo xgp1 A170   A176   red
pai_zo xgp1 A185   A192   red
ca ; end
!
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

12.5 running O

Executing this macro gives the following output (edited):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 196 gerard rigel 23:08:50 secs/database> 4d_ono general.o
  O > Use of this program implies acceptance of conditions
  O > described in Appendix 10 of the O manual
  O > O version 5.8, Sat Sep 26 13:59:06 MET 1992
[...]
@gsta_lsq.omac
  O > Macro in computer file-system.
 As4> ... analysing gsta
  O >  As4> human class alpha glutathione s-transferase model m10a
  O >  As4> ... query  B1     A1     B2     B3     B4     A3
  O >  As4> ... allowed mismatches 4 13.000 10.000 0.400
  O >  As4> ... distance type C
  O >  As4> ... directionality Y
  O >  As4> ... absolute motif Y
  O >  As4> ... neighbours N
  O >   O >   O >   O >   O >   O >   O >   O >   O >   O >   O >   O >
  O >   O >   O >   O >   O >   O >   O >  As4> ... comparing 1gp1
  O >  As4> glutathione peroxidase (e.c.1.11.1.9) - bovine (bos $taurus
  O >  As4> ... score = 14.55962
[...]
 Lsq > The 30 atoms have an r.m.s. fit of 3.645
 Lsq >  xyz(1) =    -0.7311*x+    0.6446*y+    0.2236*z+   83.3897
 Lsq >  xyz(2) =     0.1075*x+   -0.2147*y+    0.9707*z+   -7.7601
 Lsq >  xyz(3) =     0.6737*x+    0.7338*y+    0.0877*z+  -33.9970
[...]
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    26 residues located.
 Lsq > A fragment of    14 residues located.
 Lsq > A fragment of     9 residues located.
 Lsq > A fragment of     9 residues located.
 Lsq >  Loop =   10 ,r.m.s. fit =     2.529 with    58 atoms
 Lsq >  x(1) =    -0.7038*x+    0.7023*y+    0.1070*z+   85.7188
 Lsq >  x(2) =     0.0950*x+   -0.0562*y+    0.9939*z+  -10.9052
 Lsq >  x(3) =     0.7040*x+    0.7097*y+   -0.0272*z+  -29.9750
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    24 residues located.
 Lsq > A fragment of    16 residues located.
 Lsq > A fragment of     9 residues located.
 Lsq > A fragment of     9 residues located.
 Lsq >  Loop =   11 ,r.m.s. fit =     3.361 with    58 atoms
 Lsq >  x(1) =    -0.6967*x+    0.7093*y+    0.1072*z+   85.3970
 Lsq >  x(2) =     0.0397*x+   -0.1111*y+    0.9930*z+   -8.9049
 Lsq >  x(3) =     0.7162*x+    0.6961*y+    0.0493*z+  -33.0698
 Lsq > The transformation can be stored in O.
 Lsq > A blank is taken to mean do not store anything
 Lsq > The transformation will be stored in .LSQ_RT_ Lsq > Here are the
 fragments used in the alignment
 Lsq > 0    A4 PKLHYFNARGRMESTRWLLAAAGV    A27
 Lsq >     B36 LLIENVASL GTTVRDYTQMNDLQ    B59
 Lsq > 0   A28 EFEEKFIKS    A36
 Lsq >     B68 VVLGFPCNQ    B76
 Lsq > 0   A52 QQVPMVEID    A60
 Lsq >    B157 SWNFEKFLV   B165
 Lsq > 0   A61 GMKLVQTRAILNYIAS    A76
 Lsq >    B171 PVRRYSRRFLTIDIEP   B186
[...]
 Sam> Molecule XGP1 contained 555 residues and 3111 atoms
[...]
 Lsq > The 30 atoms have an r.m.s. fit of 4.841
 Lsq >  xyz(1) =    -0.1827*x+   -0.7881*y+   -0.5879*z+  157.7386
 Lsq >  xyz(2) =     0.8678*x+    0.1518*y+   -0.4732*z+   15.9964
 Lsq >  xyz(3) =     0.4621*x+   -0.5966*y+    0.6561*z+   -2.8169
 Lsq > The transformation can be stored in O.
[...]
 Lsq > 0Search for connected fragments.
 Lsq > A fragment of    24 residues located.
 Lsq > A fragment of    14 residues located.
 Lsq > A fragment of     9 residues located.
 Lsq > A fragment of     9 residues located.
 Lsq > A fragment of     5 residues located.
 Lsq >  Loop =    9 ,r.m.s. fit =     3.248 with    61 atoms
 Lsq >  x(1) =    -0.1430*x+   -0.6702*y+   -0.7282*z+  154.9774
 Lsq >  x(2) =     0.9470*x+    0.1212*y+   -0.2975*z+    9.6677
 Lsq >  x(3) =     0.2877*x+   -0.7322*y+    0.6174*z+    9.8883
 Lsq > The transformation can be stored in O.
 Lsq > A blank is taken to mean do not store anything
 Lsq > The transformation will be stored in .LSQ_RT_ Lsq > Here are the
 fragments used in the alignment
 Lsq > 0    A4 PKLHYFNARGRMESTRWLLAAAGV    A27
 Lsq >     A36 LLIENVASL GTTVRDYTQMNDLQ    A59
 Lsq > 0   A28 EFEEKFIKS    A36
 Lsq >     A68 VVLGFPCNQ    A76
 Lsq > 0   A45 NDGYL    A49
 Lsq >    A153 RNDVS   A157
 Lsq > 0   A52 QQVPMVEID    A60
 Lsq >    A157 SWNFEKFLV   A165
 Lsq > 0   A61 GMKLVQTRAILNYI    A74
 Lsq >    A172 VRRYSRRFLTIDIE   A185
[...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Again, the sequence similarity is negligible, the rms-value of the fit is not too impressive, but if you look on the screen you see a very reasonable fit (except for the last helix) !!!
One also notes that the two monomers overlap exactly, which implies that the differences in SSE-assignments must be due to round-off errors in YASSPA.
By the way, the "o_setup" instruction in the macro ensures that you get a log file from O; this will be called o_log.lst. Print it and stick it right into your laboratory notebook !!!

13 AUTOMATIC CREATION OF INPUT FILES

If you are too lazy to make your own DEJAVU input files, you can do it partially or even completely automatically. To this end there are two companion pre-processing programs, PRO1 and PRO2, as well as a csh-script called "makedb".

13.1 PRO1

this program requires a simple ASCII input file and will produce a macro for running YASSPA with O and creating several intermediate files as well as a csh-script for deleting all intermediate files afterwards. Running this program gives the following output:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- < % 113 gerard sirius 20:03:14 secs/database> cat pro1.log *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** Version - 921027/0.03 By - Gerard J. Kleywegt, Dept. Mol. Biology, BMC, Uppsala (S) User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL) Started - Fri Oct 30 20:01:03 1992 User - Mode - batch Not using a tty as input device *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 input file ? (pro1.inp) u.inp "O" script file ? (pro1.omac) u.omac csh script file ? (pro1.csh) u.csh ... processing 1UBQ ... processing 1UTG ... processing 2UTG Nr of lines read : ( 6) Nr of proteins read : ( 3) Nr of proteins processed : ( 3) *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** Version - 921027/0.03 Started - Fri Oct 30 20:01:03 1992 Stopped - Fri Oct 30 20:01:03 1992 CPU-time taken : User - 0.0 Sys - 0.1 Total - 0.1

*** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** PRO1 *** ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The input file must look something like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! u.inp
! Input file for PRO1/PRO2 created by script makedb
! File created at Fri Oct 30 20:01:01 MET 1992
'1UBQ' '/nfs/public/pdb/ubq1.pdb' 'UBIQUITIN - HUMAN (HOMO $SAPIEN'
'1UTG' '/nfs/public/pdb/utg1.pdb' 'UTEROGLOBIN (OXIDIZED) - RABBIT (ORYCTOLAGUS'
'2UTG' '/nfs/public/pdb/utg2.pdb' 'UTEROGLOBIN - UTERINE SECRETIONS '
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Every line beginning with an exclamation mark in column 1 is ignored.
The others must have the following items on one line:
- protein identifier
- PDB-file name (absolute pathnames, please !)
- comment or notes

The csh-script will look like this:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- # 1UBQ /bin/rm 1UBQ.name /bin/rm 1UBQ.struc /bin/rm 1UBQ.ca # 1UTG /bin/rm 1UTG.name /bin/rm 1UTG.struc /bin/rm 1UTG.ca

# 2UTG /bin/rm 2UTG.name /bin/rm 2UTG.struc /bin/rm 2UTG.ca ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The O macro may look as follows:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
!
! 1UBQ
s_a_i /nfs/public/pdb/ubq1.pdb 1UBQ
mol 1UBQ
yasspa 1UBQ alpha 0.5
yasspa 1UBQ beta 0.8
wr 1UBQ_residue_name 1UBQ.name (1x,5a6)
wr 1UBQ_residue_2ry_struc 1UBQ.struc (1x,5a6)
sel_on 1UBQ ;
sel_prop atom_name ^= ca off
s_a_o 1UBQ.ca pdb 1UBQ ;; yes ;;
db_kill 1UBQ*
!
! 1UTG
s_a_i /nfs/public/pdb/utg1.pdb 1UTG
mol 1UTG
yasspa 1UTG alpha 0.5
yasspa 1UTG beta 0.8
wr 1UTG_residue_name 1UTG.name (1x,5a6)
wr 1UTG_residue_2ry_struc 1UTG.struc (1x,5a6)
sel_on 1UTG ;
sel_prop atom_name ^= ca off
s_a_o 1UTG.ca pdb 1UTG ;; yes ;;
db_kill 1UTG*
!
! 2UTG
s_a_i /nfs/public/pdb/utg2.pdb 2UTG
mol 2UTG
yasspa 2UTG alpha 0.5
yasspa 2UTG beta 0.8
wr 2UTG_residue_name 2UTG.name (1x,5a6)
wr 2UTG_residue_2ry_struc 2UTG.struc (1x,5a6)
sel_on 2UTG ;
sel_prop atom_name ^= ca off
s_a_o 2UTG.ca pdb 2UTG ;; yes ;;
db_kill 2UTG*
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

After running O with this macro, we have the following files:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   2 -rw-r--r--   1 gerard       918 Oct 30 19:58 1UBQ.struc
   2 -rw-r--r--   1 gerard       918 Oct 30 19:58 1UBQ.name
  10 -rw-r--r--   1 gerard      5092 Oct 30 19:58 1UBQ.ca
   3 -rw-r--r--   1 gerard      1040 Oct 30 19:59 1UTG.name
   3 -rw-r--r--   1 gerard      1040 Oct 30 19:59 1UTG.struc
  10 -rw-r--r--   1 gerard      4690 Oct 30 19:59 1UTG.ca
   4 -rw-r--r--   1 gerard      2012 Oct 30 19:59 2UTG.name
   4 -rw-r--r--   1 gerard      2012 Oct 30 19:59 2UTG.struc
  19 -rw-r--r--   1 gerard      9380 Oct 30 19:59 2UTG.ca
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The "XXXX.struc" files contain the YASSPA datablocks, the "XXXX.name" files the residue identifiers and the "XXXX.ca" files are PDB coordinate files for only the Calpha atoms.

13.2 PRO2

now you are ready to run PRO2. It uses the same input file that PRO1 read earlier as well as all the files created by O. The output looks like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 118 gerard sirius 20:03:14 secs/database> cat pro2.log
   
 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 ***
   
 Version  - 921026/0.02
 By       - Gerard J. Kleywegt, Dept. Mol. Biology, BMC, Uppsala (S)
 User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL)
   
 Started  - Fri Oct 30 20:02:18 1992
 User     -
 Mode     - batch
 Not using a tty as input device
   
 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 ***
   
 Max nr of residues per protein : (       5000)
 Max nr of sec structure types  : (         10)
   
 PRO2 input file ? (pro1.inp)
u.inp
   
 DEJAVU database file ? (secs.newlib)
u.secs
   
 How many sec struc types ? (       2)
   
 Enter data for sec structure type : (       1)
 Name (6 characters) ? (ALPHA)
   
 Abbreviation (1 ch) ? (A)
   
 Enter data for sec structure type : (       2)
 Name (6 characters) ? (BETA)
   
 Abbreviation (1 ch) ? (B)
   
 Names : ( ALPHA BETA)
 Abbreviations : ( A B)
   
 ... processing 1UBQ
 ... UBIQUITIN - HUMAN (HOMO $SAPIEN
 ... /nfs/public/pdb/ubq1.pdb
 Nr of residue names   : (     134)
 Nr of YASSPA residues : (     134)
 Nr of res + sec struc : (      50)
 Nr of CA coordinates  : (      76)
 Nr of am ac + sec str : (      50)
 Nr of sec struc elems : (       9)
 Types : (       3        6)
   
 ... processing 1UTG
 ... UTEROGLOBIN (OXIDIZED) - RABBIT (ORYCTOLAGUS
 ... /nfs/public/pdb/utg1.pdb
 Nr of residue names   : (     153)
 Nr of YASSPA residues : (     153)
 Nr of res + sec struc : (      56)
 Nr of CA coordinates  : (      70)
 Nr of am ac + sec str : (      56)
 Nr of sec struc elems : (       5)
 Types : (       5        0)
   
 ... processing 2UTG
 ... UTEROGLOBIN - UTERINE SECRETIONS
 ... /nfs/public/pdb/utg2.pdb
 Nr of residue names   : (     305)
 Nr of YASSPA residues : (     305)
 Nr of res + sec struc : (     109)
 Nr of CA coordinates  : (     140)
 Nr of am ac + sec str : (     109)
 Nr of sec struc elems : (       9)
 Types : (       9        0)
   
 Nr of lines read         : (       6)
 Nr of proteins read      : (       3)
 Nr of proteins processed : (       3)
   
 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 ***
   
 Version - 921026/0.02
 Started - Fri Oct 30 20:02:18 1992
 Stopped - Fri Oct 30 20:02:21 1992
   
 CPU-time taken :
 User    -      0.6 Sys    -      0.2 Total   -      0.8
   
 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 *** PRO2 ***
   
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

13.3 SSE file

The result of this is a file which can be read by DEJAVU:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
!
! ===  1UBQ
!
MOL    1UBQ
NOTE   UBIQUITIN - HUMAN (HOMO $SAPIEN
PDB    /nfs/public/pdb/ubq1.pdb
!
BETA  'B1' '2' '7' 6 26.849 29.021 3.898 30.224 38.643 16.662
BETA  'B2' '11' '16' 6 31.190 42.012 12.331 31.219 27.341 4.275
ALPHA 'A1' '23' '33' 11 31.287 22.201 16.417 39.807 32.994 9.233
ALPHA 'A2' '38' '40' 3 38.816 28.019 19.889 37.737 31.636 23.712
BETA  'B3' '41' '45' 5 34.737 30.874 21.473 22.125 29.062 18.183
BETA  'B4' '49' '51' 3 25.348 26.871 23.643 29.014 21.656 22.288
ALPHA 'A3' '57' '59' 3 22.923 18.583 12.025 21.078 21.149 16.251
BETA  'B5' '60' '62' 3 19.064 21.352 12.999 20.080 24.773 8.033
BETA  'B6' '65' '74' 10 21.418 30.253 9.620 40.871 33.801 30.253
ENDMOL
!
! ===  1UTG
!
MOL    1UTG
NOTE   UTEROGLOBIN (OXIDIZED) - RABBIT (ORYCTOLAGUS
PDB    /nfs/public/pdb/utg1.pdb
!
ALPHA 'A1' '4' '14' 11 30.857 26.132 29.178 27.175 11.022 27.802
ALPHA 'A2' '18' '28' 11 36.402 7.816 28.131 39.520 21.056 36.843
ALPHA 'A3' '32' '46' 15 43.004 10.865 42.182 28.542 3.621 28.934
ALPHA 'A4' '50' '65' 16 23.195 5.333 22.534 17.765 26.475 28.934
ALPHA 'A5' '67' '69' 3 13.143 29.843 32.238 17.682 28.430 34.983
ENDMOL
!
! ===  2UTG
!
MOL    2UTG
NOTE   UTEROGLOBIN - UTERINE SECRETIONS
PDB    /nfs/public/pdb/utg2.pdb
!
ALPHA 'A1' 'A4' 'A14' 11 27.389 27.997 -3.826 33.154 25.609 10.405
ALPHA 'A2' 'A18' 'A28' 11 26.224 30.108 15.661 16.176 26.975 3.151
ALPHA 'A3' 'A32' 'A46' 15 12.301 22.187 13.161 32.443 24.577 18.017
ALPHA 'A4' 'A50' 'A65' 16 40.262 26.962 15.045 38.390 21.010 -6.594
ALPHA 'A5' 'B4' 'B14' 11 29.114 11.709 -6.099 27.611 11.044 9.247
ALPHA 'A6' 'B18' 'B28' 11 35.706 5.945 11.488 41.514 12.506 -2.421
ALPHA 'A7' 'B32' 'B46' 15 48.793 14.249 7.397 29.683 10.438 16.285
ALPHA 'A8' 'B50' 'B65' 16 22.172 8.174 15.475 18.958 18.636 -4.118
ALPHA 'A9' 'B67' 'B69' 3 15.847 24.030 -6.553 21.323 23.866 -6.168
ENDMOL
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

PRO2 is clever enough to not be bothered by the strange way in which O creates and writes residue identifiers; it knows when it deals with a DNA or polysaccharide molecules and it won't generate structural elements which comprise residues from two different chains. The only remaining problems are with integer chain id's in the PDB file and with multiple NMR structures in one PDB file.

PRO2 generates SSEs by simply looking for continuous stretches of ALPHA or BETA, retrieving the corresponding residue id's and the coordinates of the Calpha atoms of the first and last residues.
When PRO2 has finished, you may execute the csh-script to get rid of intermediate files.

13.4 makedb

if you want to automate the process completely, or if you want to create your own database of SSEs, then you may use the csh-script called "makedb".
This script processes one or several PDB files, creates the necessary input files for PRO1, O and PRO2, runs these programs and deletes all intermediate files.
To use this script, copy it to your own directory, edit it appropriately and type: source makedb. The output looks as follows (unfortunately, O output cannot be redirected; if you try to do it anyway, the program gets into an endless "empty input line"-loop):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 177 gerard sirius 19:57:47 secs/database> source makedb
... scanning PDB files ...
... /nfs/public/pdb/ubq1.pdb ...
... /nfs/public/pdb/utg1.pdb ...
... /nfs/public/pdb/utg2.pdb ...
... running PRO1 ...
... running O ...
  O > Use of this program implies acceptance of conditions
  O > described in Appendix 10 of the O manual
  O > O version 5.8, Sat Sep 26 13:59:06 MET 1992
  O > Loading /home/gerard/progs/secs/database/general.o
  O > Maximum inter-residue link distance = 6.00
  O >  There were   23 residues.
  O >              113 atoms.
  O > Do you want to use the display? [Yes]:   O >  Error in INST, object DISP_BONDS
  O >   O > Macro in computer file-system.
 Sam> File type is PDB
 Sam>  Nothing marked for deletion, so no compression.
 Sam> Molecule 1UBQ contained 134 residues and 660 atoms
  O > +
  O >  Current molecule 1ZNA   has not been loaded.
  O >  Util> Template size :    5 residues.
 Util>  There were      17
 Util>  Prompt:
  O >  Util> Template size :    5 residues.
 Util>  There were      33
 Util>  Prompt:
  O >   O >   O >   O >   O >  Sam> I can't recognise file type from the file name
 Sam> What IS the file type? [PDB]:  Sam>         76 atoms written out.
  O >  Heap>  Deleted 1UBQ_ATOM_XYZ
 Heap>  Deleted 1UBQ_ATOM_B
 Heap>  Deleted 1UBQ_ATOM_WT
 Heap>  Deleted 1UBQ_ATOM_Z
 Heap>  Deleted 1UBQ_ATOM_NAME
 Heap>  Deleted 1UBQ_ATOM_VISIBLE
 Heap>  Deleted 1UBQ_ATOM_SELECT
 Heap>  Deleted 1UBQ_RESIDUE_NAME
 Heap>  Deleted 1UBQ_RESIDUE_TYPE
 Heap>  Deleted 1UBQ_RESIDUE_POINTERS
 Heap>  Deleted 1UBQ_RESIDUE_CG
 Heap>  Deleted 1UBQ_PDB_HEADER
 Heap>  Deleted 1UBQ_PDB_COMPND
 Heap>  Deleted 1UBQ_PDB_SOURCE
 Heap>  Deleted 1UBQ_PDB_CRYST1
 Heap>  Deleted 1UBQ_PDB_SCALE
 Heap>  Deleted 1UBQ_MOLECULE_TYPE
 Heap>  Deleted 1UBQ_MOLECULE_CA
 Heap>  Deleted 1UBQ_MOLECULE_CA_MXDST
 Heap>  Deleted 1UBQ_RESIDUE_2RY_STRUC
  O >   O >   O >  Sam> File type is PDB
[...]
 Heap>  Deleted 2UTG_MOLECULE_CA_MXDST
 Heap>  Deleted 2UTG_RESIDUE_2RY_STRUC
  O >   O >  As1>  Saved
44.5u 3.2s 1:13 65%
... running PRO2 ...
... removing intermediate files ...
... started at Fri Oct 30 20:01:00 MET 1992 ...
... stopped at Fri Oct 30 20:02:22 MET 1992 ...
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The contents of the makedb script:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- # makedb - csh script to generate an input file or (partial) # database for DEJAVU # # edit this script, then do : source makedb # # Gerard Kleywegt @ 921023,24,26,27,30 # # remove the FOLLOWING lines after copying this to your own directory echo echo "makedb - sorry, you have to copy me to your own directory" echo " first, then edit me and THEN you may execute me" echo exit -1 # remove the ABOVE lines after copying this to your own directory # uncomment the following line to get all commands echoed to the screen # set echo # edit the FOLLOWING lines before executing the script *********************** # an identifier (will be used to generate filenames) set id=u # which PDB files are to be processed ? (you may use wildcards) set sour=/nfs/public/pdb/u*.pdb # set sour=/nfs/public/pdb/utg2.pdb # set sour="/nfs/public/pdb/u*.pdb /nfs/public/pdb/x*.pdb" # where are the executables of PRO1 and PRO2 ? set prog=/nfs/public/IRIX/bin # the directory and name of the O executable for your machine set oexe=/nfs/taj/alwyn/o/bin/4d_ono # any O database file of your own set ofil=/home/gerard/progs/secs/database/general.o # the scratch directory where all intermediate files are kept set scrat=/nfs/scratch/gerard # edit the ABOVE lines before executing the script *********************** # derive other file names automagically set prof=$id.inp set omac=$id.omac set scsh=$id.csh set secs=$id.secs # set some variables and redefine 'grep' set savedir=$cwd set started=`date` alias grep 'grep -i' # go to the work directory cd $scrat # *** make input file for PRO1/PRO2 *** echo ... scanning PDB files ... # write message to output file echo ! $prof > $prof echo ! Input file for PRO1/PRO2 created by script "makedb" >> $prof echo ! File created at `date` >> $prof # loop over the PDB files foreach file ($sour) # show the user that you're actually doing something echo ... $file ... # grab the molecule name from the HEADER record set molnam="`head -10 $file | grep 'header ' | cut -c63-66`" # grab the compound name from the FIRST COMPND record set compnd="`head -10 $file | grep 'compnd ' | cut -c11-59`" # get the source from the SOURCE record set source="`head -10 $file | grep 'source ' | cut -c11-29`" # add the appropriate line to the PRO1/PRO2 input file echo "'""$molnam""' '"$file"' '""$compnd - $source""'" >> $p end # *** run PRO1 *** # create input file echo $prof > temp1 echo $omac >> temp1 echo $scsh >> temp1 echo ... running PRO1 ... $prog/4d_pro1 -batch < temp1 >& pro1.log # check if there were errors grep error pro1.log # make the csh-script executable chmod +x $scsh # *** run O *** # create input file echo no > tempo echo "@$omac" >> tempo echo stop >> tempo echo ... running O ... $oexe $ofil < tempo # *** run PRO2 *** # create input file echo $prof > temp2 echo $secs >> temp2 echo 2 >> temp2 echo ALPHA >> temp2 echo A >> temp2 echo BETA >> temp2 echo B >> temp2 echo ... running PRO2 ... $prog/4d_pro2 -batch < temp2 >& pro2.log # check if there were errors grep error pro2.log # *** clean up *** echo ... removing intermediate files ... $scsh \rm temp1 tempo temp2 pro1.log pro2.log $scsh $prof $omac # go back to original directory cd $savedir echo ... started at $started ... echo ... stopped at `date` ... # unset echo

exit 0 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Note that the script assumes that there is a HEADER, at least one COMPND and at least one SOURCE card among the first ten cards in each PDB file.
If this is NOT the case, you must edit the input file that is created ($prof) and you may want to temporarily remove the statements at the end that remove all intermediate files.

14 DETAILED ANALYSIS OF RESULTS ON CRO

We mentioned before that relaxing the criteria in the search for the DNA-binding helix-(turn)-helix motif of lambda cro repressor would yield many more hits than the two we obtained in the example.
If we actually do this, we may get the following hits:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 110 gerard rose 15:24:13 progs/secs> grep s_a_i cro_relax.omac
s_a_i /nfs/public/pdb/cro1.pdb 1cro
s_a_i /nfs/public/pdb/acn5.pdb 5acn pdb
s_a_i /nfs/public/pdb/acn6.pdb 6acn pdb
s_a_i /nfs/public/pdb/api7.pdb 7api pdb
s_a_i /nfs/public/pdb/api8.pdb 8api pdb
s_a_i /nfs/public/pdb/api9.pdb 9api pdb
s_a_i /nfs/public/pdb/cat7.pdb 7cat pdb
s_a_i /nfs/public/pdb/cat8.pdb 8cat pdb
s_a_i nfs/public/pdb/ccp1.pdb 1ccp pdb
s_a_i /nfs/public/pdb/ccp2.pdb 2ccp pdb
s_a_i /nfs/public/pdb/ccp3.pdb 3ccp pdb
s_a_i /nfs/public/pdb/ccp4.pdb 4ccp pdb
s_a_i /nfs/public/pdb/cro1.pdb 1cro pdb
s_a_i /nfs/public/pdb/csc1.pdb 1csc pdb
s_a_i /nfs/public/pdb/csc2.pdb 2csc pdb
s_a_i /nfs/public/pdb/csc3.pdb 3csc pdb
s_a_i /nfs/public/pdb/csc4.pdb 4csc pdb
s_a_i /nfs/public/pdb/csc5.pdb 5csc pdb
s_a_i /nfs/public/pdb/cts1.pdb 1cts pdb
s_a_i /nfs/public/pdb/cts2.pdb 2cts pdb
s_a_i nfs/public/pdb/cts3.pdb 3cts pdb
s_a_i /nfs/public/pdb/cts5.pdb 5cts pdb
s_a_i nfs/public/pdb/cts6.pdb 6cts pdb
s_a_i /nfs/public/pdb/cyp2.pdb 2cyp pdb
s_a_i /nfs/public/pdb/cro3.pdb 3cro pdb
s_a_i /nfs/public/pdb/hco1.pdb 1hco pdb
s_a_i /nfs/public/pdb/icd3.pdb 3icd pdb
s_a_i /nfs/public/pdb/icd4.pdb 4icd pdb
s_a_i /nfs/public/pdb/icd5.pdb 5icd pdb
s_a_i /nfs/public/pdb/icd6.pdb 6icd pdb
s_a_i /nfs/public/pdb/icd7.pdb 7icd pdb
s_a_i /nfs/public/pdb/icd8.pdb 8icd pdb
s_a_i /nfs/public/pdb/icd9.pdb 9icd pdb
s_a_i /nfs/public/pdb/lap1.pdb 1lap pdb
s_a_i /nfs/public/pdb/lrd1.pdb 1lrd pdb
s_a_i /nfs/public/pdb/lzm2.pdb 2lzm pdb
s_a_i /nfs/public/pdb/lzm3.pdb 3lzm pdb
s_a_i /nfs/public/pdb/or12.pdb 2or1 pdb
s_a_i /nfs/public/pdb/phs1.pdb 1phs pdb
s_a_i /nfs/public/pdb/sic1.pdb 1sic pdb
s_a_i /nfs/public/pdb/trc1.pdb 1trc pdb
s_a_i /nfs/public/pdb/ts13.pdb 3ts1 pdb
s_a_i /nfs/public/pdb/ts14.pdb 4ts1 pdb
s_a_i /nfs/public/pdb/xia1.pdb 1xia pdb
s_a_i /nfs/public/pdb/xia4.pdb 4xia pdb
s_a_i /nfs/public/pdb/xia5.pdb 5xia pdb
s_a_i /nfs/public/pdb/xia6.pdb 6xia pdb
s_a_i /nfs/public/pdb/xia7.pdb 7xia pdb
s_a_i /nfs/public/pdb/xia8.pdb 8xia pdb
s_a_i /nfs/public/pdb/xia9.pdb 9xia pdb
s_a_i /nfs/public/pdb/55c1.pdb 155c pdb
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In fact, we used the following parameters:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 107 gerard sirius 15:24:58 secs/database> more cro_relax.omac
! "O" macro cro_relax.omac
! created by DEJAVU                 at Fri Oct 30 15:26:41 1992
!
o_setup off off on
!
print ... analysing 1cro
print cro repressor - bacteriophage (lamb
print ... query  A2     A3
print ... allowed mismatches 2 6.000 5.000 0.250
print ... distance type H
print ... directionality Y
print ... absolute motif Y
print ... neighbours Y
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

14.1 results

We have processed a representative selection of these hits with O (i.e., using only the best scoring protein of a set of related ones, such as the seven xia, d-xylose isomerase). The results are summarised in the following table.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 =========================================================================================
                                        15   20   25   30   35
                                        |    |    |    |    |
 1cro score  rmsX  NI  rmsI    O11  AMRFGQTKTAKDLGVYQSAINKAIHAGR   O38  lambda cro repressor
                                         XXXXXXXX   XXXXXXXXXX          (the two helices)
 =========================================================================================
 5acn  9.36  2.67  22  3.40    733      ETQIEWFRAGSALNRMKELQQK     754  aconitase
 8api  5.01  4.77  21  3.11   A264              ENELTHDIITKFLEN   A278  alpha-1-antitrypsin
 8cat  6.63  4.63  18  2.78   B252           LAHEDPDYGLRDLFNAIA   B269  catalase
 2ccp  6.09  5.45  14  1.79    240                QDPKYLSIVKEYAN   253  cytochrome-c peroxidase
 2cts  7.11  3.04  27  2.92     66  FRGFSIPECQKLLPK                 80  citrate synthase
                                87                 PLPEGLFWLLVT     98
 2cyp  6.39  5.47  32  2.93    202..NE                             209  cytochrome-c peroxidase
                               241                  DPKYLSIVKEY    251
                                91..KE                              98  (with cro A-chain)
                                15      SYEDF                       19  (with cro B-chain)
 3cro  9.83  5.53  31  2.90    R56..QYG                            R62  434 cro repressor
                               R40        KRPRFLF                  R46
                               L41                RPRFLFEIAMALNC.. L57
 1hco  6.41  4.84  17  3.09    B42   FESFGD                        B47  haemoglobin
                               B57                  NPKVKAHGKKV    B67
 5icd  6.85  5.33  29  3.20     85                 PAETLDLIREYR     96  isocitrate dehydrogenase
                               353..GSII                           357  (with cro C-chain)
                               386                 AKTVTY          391  (with cro C-chain)
 1lap  4.86  5.77  21  2.61    425              RSAGACTAAAFLKEF    439  leucine aminopeptidase
 1lrd  2.84  0.60  25  3.70 !  329    LGLSQESVADKMGMGQSGVGALFNG    353  lambda repressor
 3lzm  7.68  3.95  29  2.87     95..ALIN                           101  lysozyme
                               113                 GFTNSLRMLQQKR.. 127
 2or1  3.12  0.67  32  3.40 !   L5..RI                             L11  434 repressor
                               L13    LGLNQAELAQKVGTTQQSIEQLENG    L37
 1phs  3.69  2.09  52  3.06    340..RALDGKDVLGLTFSGSGDEVMKLINKQ    372  phaseolin
                                39..QQSK                            44  (with cro A-chain)
                                13..YFNSD                           19  (with cro B-chain)
 1sic  6.47  3.55  24  2.59   E229     GAAALILS                   E236  subtilisin
                              E238             HPNWTNTQVRSSLQNT   E253
 1trc  1.02  2.96  25  2.36    A99    YISAAELRHV                  A108  calmodulin
                              A114              EKLTDEEVDEMIREA   A128
 4ts1  6.28  4.72  24  3.24   A144     SVNYM                      A148  Tyr-tr-RNA synthase
                              A152                    ESVQSRIETG..A165
                               B35               CGFDP             B39  (with cro C-chain)
 6xia  5.98  2.98  29  2.49    215      PEVGHEQMAGLNFPHGIAQALWA    237  d-xylose isomerase
 155c  6.00  4.15  18  2.86     73      ANLIEY                      78  cytochrome-c550
                                80                 TDPKPLVKKMTD     91
 =========================================================================================
                                         XXXXXXXX   XXXXXXXXXX          (the two helices)
 1cro score  rmsX  NI  rmsI    O11  AMRFGQTKTAKDLGVYQSAINKAIHAGR   O38  lambda cro repressor
                                        |    |    |    |    |
                                        15   20   25   30   35
 =========================================================================================
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Legend: the first column contains the PDB identifier which is followed by the score according to DEJAVU, the rms fit of the Calpha atoms using the lsq_explicit option in O, the number of matched residues as determined by the lsq_improve option in O and the rms fit of the Calpha atoms of these residues. The right-hand part of the table shows (some of) the structural alignments found by lsq_improve in sofar as they pertain to residues in and around the helix-turn-helix motif of 1cro.

NOTE: since lsq_improve does a global optimisation for the alignment of two proteins, the resulting picture simetimes is worse than after a simple lsq_explicit (e.g., for 1lrd and 2or1). Also, this option is sometimes unstable, alternating between two solutions and not always ending up with the best one.

NOTE: there doesn't seem to be a simple correlation between the DEJAVU scores and the rms-fit values, so be careful when throwing away hits with a high DEJAVU score (e.g., 5acn and 2cts) !

NOTE: how widely different amino-acid sequences may yield similar spatial motifs !!!

NOTE: the best hits are those for which both helices are part of a long matching sequence of residues (i.e., 5acn, 2cts, 1lrd, 2or1, 1phs, 1sic, 1trc, 6xia and 155c).

15 MISCELLANEOUS

15.1 HOW TO CREATE AND USE YOUR OWN DATABASE

- use the script makedb to create your DEJAVU database(s)
- copy the public file "secs.lib" (this is only a few lines) to your own directory
- add a "CHAIN" statement that points to your own database
- add a "CHAIN" statement to your own database which points back to the local database (e.g., "uppsala.secs"; this file in turn should be chained to the PDB-derived database, e.g., "pdb.secs")
- enter the file name of your private library when DEJAVU asks you for the name of the database file; all chained databases will then also be read

15.2 HOW TO SELECT SEARCH PARAMETERS

- usually it's a good idea to start with rather strict parameters
- if a lot of hits come up, you can either repeat the search with even stricter parameters or check all hits in O
- if not many hits show up, relax your parameters a bit (mismatches of 5 residues, 17 A, 10 A and 0.4 in the cosines) and repeat
- if this doesn't help, relax the "binary" search criteria, first the conservation of neighbours, then that of absolute motif. Also try the three different distance measures (C, H and T). Only as a last resort should you release the directionality !
- if you still don't get any reasonable hits, you could try looking at a partial motif containing fewer SSEs (or you may conclude that you have a unique fold ...)

15.3 OTHER HINTS

- note that the various "print" statements in the lsq-macro for O make that your O log file automatically serves as an electronic notebook ! A quick way to get an overview of your hits: csh-prompt> grep -i print lsq.omac
- if lsq_improve in O goes "haywire", e.g. matches only one helix perfectly but leaves the others sticking in the wrong directions, then re-do the lsq_explicit, lsq_mol, paint_zone and ca_zone instructions (e.g., by cutting and pasting on SGIs)
- always compare the folds of the interesting hits; sometimes, the spatial arrangements may be similar, whereas the folds are quite different !

15.4 PROBLEMS

- DEJAVU does not know that SSEs may be in different chains, so you may get hits consisting of a few SSEs from one monomer and a few from another monomer which together form a motif similar to yours
- YASSPA is not perfect and usually gives slightly different SSE-assignments than a protein scientist would make
- O residue names sometimes present a problem; they consist of a "$" (if the atom is a HETERO atom) + the chain id. + the residue number + the insert id. (the biggest trouble arises if people use numbers for the chain id. and/or the insert id.)
- if you encounter enormous problems, you may contact me. It is best to send me a mail which includes your protein's PDB and DEJAVU files and the names of the SSEs you wish to find. My E-mail address is: "gerard@xray.bmc.uu.se"

16 SELECT OPTION

If you want to compare your structure with a subset of the PDB structures, you can use the select option:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ)
sele
   
 Options :
 (1) Select ALL entries
 (2) Select NONE of the entries
 (3) Select ON for one or more entries
 (4) Select OFF for one or more entries
 (5) Read a select macro file
 Option (1-5) ? (       1)
1
 Selected ALL entries
   
 Nr of selected entries now : (     607)
   
  2 CPU total/user/sys :       0.0       0.0       0.0
   
 ===> Option ? (SELE)
   
 Options :
 (1) Select ALL entries
 (2) Select NONE of the entries
 (3) Select ON for one or more entries
 (4) Select OFF for one or more entries
 (5) Read a select macro file
 Option (1-5) ? (       1)
5
 Select macro file ? (user.sel)
cici.select
   
 Selected NONE of the entries
 Select ON 1alc
 Select ON 2apr
 Select ON 5apr
 Select ON 1bp2
 Select ON 3bp2
 Select ON 4bp2
 ERROR --- Invalid entry code: 2c4s
 Select ON 1cdp
 Select ON 3cln
 Select ON 2cna
 Select ON 3cna
 Select ON 4cpv
 Select ON 5cpv
...
 Select ON 1trc
 Select ON 1trm
 Select ON 2trm
   
 Nr of selected entries now : (      87)
   
  2 CPU total/user/sys :       0.3       0.3       0.1
   
 ===> Option ? (SELE)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

A select file may contain comments (any line beginning with "!") and select records; possible types:
- select all
- select none
- select on pdb_code
- select off pdb_code

A select file may look as follows:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
< % 147 gerard sirius 23:09:47 secs/cbh1> cat cici.select
! Select file for DEJAVU
! Created by select.csh
! At Thu Feb 18 22:45:45 MET 1993
! Keywords calcium
!
Select none
Select on 1ALC
Select on 2APR
Select on 5APR
...
Select on 1TRC
Select on 1TRM
Select on 2TRM
!
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Use the following C-shell script (or an adaptation) to generate select files automatically by scanning for one or more keywords in all PDB files:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
#!/bin/csh -f
# select.csh - Gerard Kleywegt 1993
if ($#argv < 1) then
  echo
  echo "usage: $0 keyword1 [keyword2 ...]"
  echo
  exit 1
endif
#
set pdbdir=/nfs/public/pdb
#
set alfabet='a b c d e f g h i j k l m n o p q r s t u v w x y z'
set out=$argv[1].select
#
echo Looking for $argv[1-$#argv]
echo Select file $out
#
echo "! Select file for DEJAVU "  > $out
echo "! Created by $0"            >> $out
echo "! At `date`"                >> $out
echo "! Keywords $argv[1-$#argv]" >> $out
#
echo "! " >> $out
echo "Select none" >> $out
# loop over all letters in the alphabet
foreach letter ($alfabet)
  set files=`echo $pdbdir/$letter"*.pdb"`
  echo
  echo There are $#files PDB files beginning with the letter $letter
# loop over all files beginning with this letter
  foreach pdb ($files)
#   loop over all keywords
    foreach key ($argv)
#     count the nr of times this keyword occurs in the file
      set hits=`grep -c -i $key $pdb`
      if ($hits == 0) then
        goto failure
      endif
    end
#   if here, the file contains all keywords
    set molnam="`head -10 $pdb | grep -i 'header    ' | cut -c63-66`"
    set compnd="`head -10 $pdb | grep -i 'compnd    ' | cut -c11-59`"
    echo Protein $molnam in file $pdb
    echo Possible name "$compnd"
    echo "Select on $molnam" >> $out
#   in case of failure, you come here immediately
    failure:
  end
end
#
echo "! " >> $out
echo Done ...
exit 0
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

17 INCREMENTAL SEARCH EXAMPLE

The following is an example of an incremental search:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ)
in
   
 ********** NEW QUERY **********
   
 Elements : ( B1 B2 B3 B4 A1 B5 A2 A3 B6 B7 B8 B9 B10 B11 B12
 B13 A4 A5 B14 B15 B16 B17 B18 B19 A6 B20 B21 B22 B23 B24 A7
 A8 A9 B25 A10 A11 B26)
 Min nr of residues for SSEs             ? (       5)
6
 ................... ( B3 B4 A3 B8 B9 B11 B16 B17 B21 B22 A7
 A9 B25 A11 B26)
 Min nr of elements to match (0 = abort) ? (       4)
5
   
 Mismatch nr of residues ? (          3)
   
 Mismatch element length ? (  10.000)
   
 Mismatch distances      ? (   8.000)
   
 Mismatch cosines        ? (   0.400)
   
 Weights for scoring     ? (   0.250    0.250    0.250    0.250)
1 1 10 5
 Normalised weights      : (   0.059    0.059    0.588    0.294)
   
 Possible distance criteria:
  C  => centre-to-centre
  H  => MIN head-tail and tail-head (anti-parallel)
  T  => MIN head-head and tail-tail (parallel)
  I  => MIN of all these distances
  A  => MAX of all these distances
 Which distances (C/H/T/I/A) ? (C)
   
 Extensive output        ? (N)
   
 Conserve directionality ? (Y)
   
 Conserve absolute motif ? (Y)
   
 Conserve neighbours     ? (N)
   
 Attempt to avoid multi-chain hits ? (Y)
   
 Attempt to avoid identical proteins ? (Y)
   
 Create "O" macro file   ? (Y)
   
 "O" macro file          ? (lsq.omac)
   
 Nr of elements recognised in query : (      15)
 Indices : (       3        4        8       11       12
       14       21       22       27       28       31
       33       34       36       37)
 Nr of elements of each type : (       4       11)
   
 ********** 2cna       **********    108 **********
 [concanavalin a - jack bean (canavali                                  ]
 [/nfs/public/pdb/cna2.pdb                                              ]
 QUERY    : (       3        4        8       11       12
       14       21       22       27       28       31       33
       34       36       37)
 Elements :    B3       B4       A3       B8       B9
       B11      B16      B17      B21      B22
   
 A7       A9       B25      A11      B26
 Lengths  : (  26.477   31.328   10.053   22.441   24.508
   23.564   23.091   25.716   26.247   23.934   13.939   11.969
   19.554    9.769   27.656)
 Residues : (       9       11        7        9        9
        8        9        9        9        8       10        9
        7        7       10)
 Nr of common SSEs : (       5)
   
 MATCH    : (       0        7        0        9       10
       12        0        0       20        0        0        0
        0        0        0)
 Elements :    -X-      B6       -X-      B8       B9       B10
      -X-      -X-      B18      -X-     -X-      -X-      -X-
      -X-      -X-
 Lengths  : (  23.720   23.278   23.972   31.742   17.850)
 Residues : (       9        8        8       11        6)
 Length   ... rmsd =      6.265 ... match =      0.970
 Residues ... rmsd =      2.191 ... match =      0.973
 Distance ... rmsd =      4.260 ... match =      0.970
 Cosines  ... rmsd =      0.146 ... match =      0.981
 SCORE : (   3.163)
   
 Nr of hits        : (       1)
 Nr of common SSEs : (       5)
 Nr of best match  : (       1)
 Best score        : (   3.163)
   
 Nr of matching entries : (          1)
 Nr of hits (total)     : (          1)
   
 Entry    108 = 2cna = concanavalin a - jack bean (canavali
   
  2 CPU total/user/sys :       3.2       3.0       0.3
   
 ===> Option ? (IN)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

18 TOPOLOGY OPTION

This, rather crummy, option may help you in fathoming the topology of your protein. You enter a cosine and a distance cutoff which determine whether or not two SSEs are parallel (cosine >= cutoff) or anti-parallel (cosine <= -cutoff) and whether they are spatial neighbours (distance <= cutoff). A matrix is printed which contains +2 for parallel neighbours, +1 for parallel, -1 for anti-parallel and -2 for anti-parallel neighbours.

The first number is the sum of the absolute values of the matrix entries for an SSE (if high, then central in a motif), the second is the number of spatial neighbours. You should choose your cut-off such that no SSE has more than 2 spatial neighbours.

DEJAVU produces a file which can be plotted (and converted into PostScript) with O2D (use "open 2 topo 0 1" to open a 2D window, then type "topo mytopo.file mytopo.ps" and voila).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 COSine   cut-off  ? (   0.800)
 DIStance cut-off  ? (   8.000)
 O2D topology file ? (cbh6a.topo)
 A1       5   1    11 -1  0  0  0  0  0  0  0  1 -1  2  0
 A2       6   0    -1 11  0 -1  0  0  0  0  0 -1  1 -1  1
 B1       3   1     0  0 11  0 -2  1  0  0  0  0  0  0  0
 B2       4   1     0 -1  0 11  0  0  0  0  0  0  0  1 -2
 B3       6   2     0  0 -2  0 11 -2  1 -1  0  0  0  0  0
 B4       6   2     0  0  1  0 -2 11 -2  1  0  0  0  0  0
 B5       5   2     0  0  0  0  1 -2 11 -2  0  0  0  0  0
 B6       4   1     0  0  0  0 -1  1 -2 11  0  0  0  0  0
 B7       2   1     0  0  0  0  0  0  0  0 11 -2  0  0  0
 B8       7   2     1 -1  0  0  0  0  0  0 -2 11 -2  1  0
 B9       7   2    -1  1  0  0  0  0  0  0  0 -2 11 -2  1
 B10      9   3     2 -1  0  1  0  0  0  0  0  1 -2 11 -2
 B11      6   2     0  1  0 -2  0  0  0  0  0  0  1 -2 11
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

19 INSTALLING THE SOFTWARE

The system manager will have to do the following:

* put the appropriate executables in directories which are accessible by local DEJAVU users

* change the "make_sse" script (site-specific executables)

* copy the big PDB-derived libraries to an accessible directory

* change the file names of ALL PDB files mentioned in the big PDB-derived libraries so that they point to the disk etc. where you keep your local copies of the uncompressed PDB files. In Uppsala, all PDB files are in a directory called /nfs/pdb/full. If you keep your PDB files in a directory called /usr/mnt/people/pdb, change the big library file accordingly, e.g., using a (stream) editor, OR make a soft link in "/", as follows: ln -s /usr/mnt/people/pdb /nfs/pdb/full
If you create a soft link, you do NOT have to edit the big library file !
Example of changing the libraries with "sed":

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 echo "s%/nfs/pdb/full%/y/database/brookhaven/pdb%g" > q.sed
 sed -f q.sed full_pdb.lib > q ; mv q full_pdb.lib
 echo "s%/nfs/pdb/pre%/y/database/brookhaven/pdb%g" > q.sed
 sed -f q.sed pre_pdb.lib > q ; mv q pre_pdb.lib
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Change the CHAIN card at the bottom of all lib files !

* provide users with a minimalist DEJAVU library file which should AT LEAST contain the following lines:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- TYPE 'ALPHA' 'alpha helix' TYPE 'BETA' 'beta strand'

CHAIN your_local_big_pdb-derived_dejavu_library_file ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In between the TYPE and the CHAIN commands, the user may insert SSE records of his/her own structures (see the example dejavu_user.lib file). NOTE that keywords should be left-justified, uppercase strings of SIX characters (i.e., add trailing spaces if necessary).
NOTE that you may "chain" an unlimited number of SSE files; I like to have my personal file first, then a file with structures solved in Uppsala but not yet in the PDB and finally the big PDB-derived library.

20 SYMBOLIC MATCHING

As of version 5.3, DEJAVU is capable of "symbolic matching". In this case, the spatial information regarding the SSEs is completely ignored, and only their type and length (nr of residues) are used (as well as the number of residues in gaps between neighbouring SSEs).
This option can be useful if you get no hits at all; for example, a domain rearrangement may screw up coordinate-based searches, but symbolic matching may still work.
Another application is when you have a very reliable secondary structure prediction, but no structure (yet). Make an SSE file and use dummy coordinates, e.g.:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
MOL    P2
NOTE   P2 myelin protein for testing symbolic matching
BETA   'B1'  'A7'   'A9'    3  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B2'  'A12'  'A14'   3  0.0 0.0 0.0 1.0 1.0 1.0
ALPHA  'A1'  'A16'  'A23'   8  0.0 0.0 0.0 1.0 1.0 1.0
ALPHA  'A2'  'A27'  'A35'   9  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B3'  'A37'  'A45'   9  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B4'  'A48'  'A55'   8  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B5'  'A58'  'A64'   7  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B6'  'A68'  'A74'   7  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B7'  'A78'  'A87'  10  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B8'  'A90'  'A97'   8  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B9'  'A100' 'A109' 10  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B10' 'A112' 'A119'  8  0.0 0.0 0.0 1.0 1.0 1.0
BETA   'B11' 'A122' 'A129'  8  0.0 0.0 0.0 1.0 1.0 1.0
ENDMOL
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Now run DEJAVU (see below). Note that 11 of the first 12 hits are proteins that belong to the same family (and have the same fold) as P2 myelin protein.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ********** NEW QUERY **********
   
 Elements : ( B1 B2 A1 A2 B3 B4 B5 B6 B7 B8 B9 B10 B11)
 Nr of SSEs : (      13)
 Min nr of residues for SSEs             ? (       4)
 Nr of SSEs : (      11)
 Remaining SSEs : ( A1 A2 B3 B4 B5 B6 B7 B8 B9 B10 B11)
 Min nr of elements to match (0 = abort) ? (       9)
   
 Is this a BONES search ? (N)
   
 Is this a SYMBOLIC search ? (Y)
   
 SYMBOLIC search; no LSQ done
   
 Define how much the nr of residues in SSEs may differ
 by defining how many residues shorter or longer SSEs in
 the database may be compared to those in your protein.
 Max nr of residues "too short" ? (          3)
 Max nr of residues "too long"  ? (          3)
   
 [...]
   
 ********** 1opb       **********   1243 **********
 [cellular retinol binding protein ii (holo form) (holo-crbpii - rat (r ]
 [/nfs/pdb/full/1opb.pdb                                                ]
 Elements :    A1     A2     B3     B4     B5     B6     B7     B8     B9     B10
 B11
 Nr of common SSEs : (      10)
 Elements :    A1     A2     B3     B4     B5     B6     -X-    B7     B8     B9
 B10
 Total mismatched residues : (       9)
 Total gaps mismatch       : (       7)
 Elements :    A1     A2     B3     B4     B5     B6     -X-    B8     B9     B10
 B11
 Total mismatched residues : (       6)
 Total gaps mismatch       : (       5)
 Elements :    A1     A2     B3     B4     B5     -X-    B6     B7     B8     B9
 B10
 Total mismatched residues : (      10)
 Total gaps mismatch       : (      12)
 Elements :    A1     A2     B3     B4     -X-    B5     B6     B7     B8     B9
 B10
 Total mismatched residues : (      10)
 Total gaps mismatch       : (      12)
 Elements :    A1     A2     B3     -X-    B4     B5     B6     B7     B8     B9
 B10
 Total mismatched residues : (      11)
 Total gaps mismatch       : (      13)
 Elements :    A1     A2     -X-    B3     B4     B5     B6     B7     B8     B9
 B10
 Total mismatched residues : (      12)
 Total gaps mismatch       : (      12)
   
 Nr of hits        : (       6)
 Nr of common SSEs : (      10)
 Nr of best match  : (       2)
 Best score        : (   6.000)
 Best gap mismatch : (   5.000)
   
 [...]
   
 Nr of database entries : (       2182)
 Nr of selected entries : (       2182)
 Nr of matching entries : (         39)
 Nr of hits (total)     : (        639)
   
 Sorting hits ...
   
   Nr Entry  PDB  SSE  GAPS SCORE Compound
 ==== ===== ==== ==== ===== ===== ========
    1  1327 1pmp   11     0     0 p2 myelin protein (p2) - bovine (bos taurus) caudal spinal root myeli
    2   675 1ftp   11     3     2 fatty-acid-binding protein - desert locust (schistocerca gregaria)
    3   545 1eal   11     5    10 nmr study of ileal lipid binding protein - organism_scientific: sus s
    4   440 1crb   11    11     9 cellular retinol binding protein (crbp) complexed with all-t - rat (r
    5   823 1hmt   10     1     1 fatty acid binding protein (human muscle, m-fabp) complexed - organis
    6  1036 1lid   10     1     1 adipocyte lipid-binding protein complexed with oleic acid - mouse (mu
    7  1029 1lfo   10     1     4 liver fatty acid binding protein - oleate complex - organism_scientif
    8  1243 1opb   10     5     6 cellular retinol binding protein ii (holo form) (holo-crbpii - rat (r
    9   635 1fie   10    23    12 recombinant human coagulation factor xiii - organism_scientific: homo
   10   353 1cbi    9     4     5 apo-cellular retinoic acid binding protein i - organism_scientific: m
   11   355 1cbs    9     5     5 cellular retinoic-acid-binding protein type ii complexed wit - human
   12  1105 1mdc    9     5     7 fatty acid binding protein (manduca sexta) (mfb2) - tobacco hornworm
   13  1193 1nir    9     7     7 oxydized nitrite reductase from pseudomonas aeruginosa - organism_sci
   14  2018 2tbv    9     7    13 tomato bushy stunt virus - tomato bushy stunt virus
   
 [...]
   
   37   592 1esf    9    28    17 staphylococcal enterotoxin a - organism_scientific: staphylococcus au
   38   934 1ivd    9    45    12 influenza a subtype n2 neuraminidase (sialidase) (e.c.3.2.1. - influe
   39  1831 2bpa    9  1823    14 bacteriophage phix174 capsid proteins gpf, gpg, gpj and four - bacter
   
  2 CPU total/user/sys :       6.9       6.7       0.2
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

21 RELEASE NOTES

* 930125 - new distance options I (= min of all other types of distances) and A (= max of ditto)

* 930125 - names of SSEs are now all converted to upper case, i.e., no longer case-sensitive

* 930125 - implemented incremental search, i.e. a search for the maximum common motif of your protein and all of the database proteins; the input is the same as for the FIND option, except that you don't provide a set of SSEs but only the minimum number of SSEs that must be matched. This type of search may take a while if your protein contains many SSEs ! Note that you may also specify a minimum length (in residues) which will affect the choice of the query elements and of those from the database structures. Set the minimum length to 5 residues, for example, in order to ignore about hits involving tiny SSEs

* 930125 - implemented option to tell DEJAVU to try and avoid multiple chain hits by using only SSEs which have the same chain identifier for their first residue (in the range 'a' - 'z' or 'A' to 'Z') as the first SSE of each database protein

* 930222 - SELECT option (see above); option to try and avoid hits with multiple copies of the same protein (i.e., if you found a hit with 1LYZ, DEJAVU will skip 2LYZ etc.). It compares the last three characters of the PDB code with those of all proteins that already yielded hits; if they are identical, the protein is skipped (this is not 100 % fail-proof and you might miss interesting hits !!!)

22 KNOWN BUGS

None, at present.

Created at Fri Dec 18 19:42:09 1998 by MAN2HTML version 971024/1.6