Uppsala Software Factory

Uppsala Software Factory - SPASM Manual

1 SPASM - GENERAL INFORMATION
2 REFERENCES
3 VERSION HISTORY
4 INTRODUCTION
5 DATABASE
6 CUSTOM DATABASES (MKSPAZ)
7 SEARCH-PATERN INPUT FILE
8 RUNNING SPASM

8.1 startup

8.2 database file

8.3 search-pattern file

8.4 identifier

8.5 cut-off values

8.6 sequence

8.7 output

8.8 main chain and/or side chain

8.9 O macro and datablock files

8.10 LSQMAN input files
9 OUTPUT
10 LOOKING FOR LOOPS
11 INTERFACE TO O
12 NON-LOCAL SIMILARITIES
13 LOCAL PDB DIRECTORY STRUCTURE
14 KNOWN BUGS

1 SPASM - GENERAL INFORMATION

Program : SPASM
Version : 990301
Author : Gerard J. Kleywegt, Dept. of Cell and Molecular Biology, Uppsala University, Biomedical Centre, Box 590, SE-751 24 Uppsala, SWEDEN
E-mail : gerard@xray.bmc.uu.se
Purpose : detection of main and side chain motifs
Package : SPASM

2 REFERENCES

Reference(s) for this program:

* 1 * M. Harel, G.J. Kleywegt, R.B.G. Ravelli, I. Silman & J.L. Sussman (1995). Crystal structure of an acetylcholinesterase- fasciculin complex: interaction of a three-fingered toxin from snake venom with its target. Structure 3, 1355-1366. [http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=8747462&form=6&db=m&Dopt=r]

* 2 * G.J. Kleywegt (1998). Deja-vu all over again. CCP4/ESF-EACBM Newsletter on Protein Crystallography 35, July 1998, pp. 10-12. [http://alpha2.bmc.uu.se/usf/factory_9.html]

* 3 * G.J. Kleywegt & T.A. Jones (1998). Databases in protein crystallography. Acta Cryst D54, 1119-1131. [http://alpha2.bmc.uu.se/~gerard/papers/databases.html] [http://www.iucr.org/iucr-top/journals/acta/tocs/actad/1998/actad5406_1.html]

* 4 * G.J. Kleywegt (1998 ?). Recognition of spatial motifs in protein structures. J Mol Biol, in press.

* 5 * G.J. Kleywegt & T.A. Jones (1999 ?). Chapter 25.2.6. O and associated programs. Int. Tables for Crystallography, Volume F. To be published.

3 VERSION HISTORY

950109 - 0.1 - trial quick-n-dirty program -> works well !
950110 - 0.2 - more bells and whistles; select MC and/or SC to use for superpositioning
950111 - 0.3 - continue; interface to O
950112 - 1.0 - first version for the "general public" (Uppsala only)
950118 - 1.1 - sensitive to environment variable GKLIB; debugged option to use generic residue type "XXX" (main-chain only !); added sketch instructions for the search pattern object to the O macro
950119 - 1.2 - introduced option to conserve neighbours
950120 - 1.3 - minor changes
950126 - - added MKSPAZ (v. 1.0) to generate new library entries
950421 - 2.0 - redid database to use CAs instead of main-chain centre-of-gravity to get better results with NHANCE when generating coordinates for beta-strands; changes propagated through all programs
951005 - 2.1 - new databse (Hobohm & Sander 95% list); redimensioned for 1200 residues
970124 - 2.2 - optional generation of LSQMAN input file (to detect more global similarities)
971127 - 3.0 - implemented use of BLOSUM-45 substitution matrix to decide which substitutions are allowed; optional generation of multiple sequence alignment file for use with MSEQPRO for profile analysis
980210 -3.0.1- increased dimensioning so the program can handle 1OCC
980318 -3.0.2- minor bug fix (first residue of "Your sequence" was usually the wrong one ;-)
980909 - 3.1 - added a new substitution option in which the user can define which residue-type substitutions will be allowed (e.g., HIS<->HIS, but GLN<->GLU,ASP,GLN)
981007 - X - the jiffy program DEJANA (part of the DEJAVU package) has been changed so it can also be used with O macros produced by SPASM and RIGOR !
990301 - 3.2 - separate distance cut-offs for CA/CA and sidechain/SC mismatches

4 INTRODUCTION

SPASM stands for "SPatial Arrangements of Side chains and Main chains". It is a complementary program to DEJAVU: DEJAVU can be used to find similar arrangements of helices and strands, and SPASM can be used to find similar arrangements of side chains and main chains (e.g., loops, turns, active sites, metal-binding sites, etc.).
The program is based on an idea of Artymiuk et al.; reference: P. Artymiuk, "Fold Recognition", in "Making the Most of Your Model" (J.N. Thornton & W.N. Hunter, Eds.), Daresbury Laboratory, pp. XXX-YYY (1995).
The algorithm is basically the same as that used by DEJAVU (i.e., an exhaustive, recursive, depth-first search with early pruning of the search tree), but the input and the database differ, of course.
The program is surprisingly fast, and surprisingly little noise tends to be generated (unless you have relaxed criteria). SPASM is interfaced with O, since it can produce an O macro to read, draw and align all hits to your search pattern.
The accompanying program MKSPAZ can be used to generate new library entries from standard PDB files. This can be used to include new proteins into the librayr, or to generate small libraries which contain only those proteins you want to compare your structure with.

NOTE: This program is sensitive to the environment variable GKLIB. If set, the name of this directory will be prepended to the default name for the library file needed by this program. For example, in Uppsala, put the following line in your .login or .cshrc file: setenv GKLIB /nfs/public/lib

5 DATABASE

The database (spasm.lib) contains information about the CA atoms and the centres-of-gravity of the side-chain atoms of > 200,000 residues from ~950 proteins (Hobohm & Sander 95% homology list of August 1995).
The format of the database is simple, and you can easily add more proteins to it (with MKSPAZ). An entry for a protein may look as follows:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! Created by MAKEDB V. 950421/2.0 at Fri Oct 6 00:30:06 1995 for user gerard
!
PRO 125D
PDB /nfs/pdb/full/125d.pdb
RES  99.99
CMP CD2-GAL4 (65-RESIDUE DNA-BINDING DOMAIN) (YEAST) (NMR, 22 STRUCTURES)
MET     1  2.427 -14.350 -17.374 -0.570 -14.374 -18.601
LYS     2  4.409 -11.142 -17.694 3.077 -8.545 -19.102
...
PRO    42  3.389 -0.744 -14.643 1.531 -0.759 -14.821
LYS    43  3.879 -0.178 -18.355 2.235 0.052 -21.732
END
!
PRO 135L
PDB /nfs/pdb/full/135l.pdb
RES   1.30
CMP LYSOZYME (E.C.3.2.1.17)
LYS     1  25.408 20.195 26.922 27.543 17.709 27.832
VAL     2  23.949 17.686 24.540 23.390 18.223 22.658
...
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Any line starting with an exclamation mark ("!") is ignored as being a comment line. A protein entry starts with "PRO XXXX", where "XXXX" is an identifier (e.g., PDB code). The physical location of the protein's PDB file is stored in the PDB record, the resolution (in A; 99.99 for NMR structures and other files without a resolution remark) on the RES record, and the name of the protein in the CMP record. These first four records *must* appear in this order, but comment lines may be interspersed.
For each residue there is one line which contains the residue identifier in columns 1-10 (columns 18-27 of a regular PDB record), followed by the coordinates of the CA atom and the centre-of-gravity of the side-chain atoms in the order CaX, CaY, CaZ, SX, SY, SZ (can be read in free format). The "END" record signals the end of the residue list.
In principle, you can add as many proteins to the database as you wish, since the file is rewound, read and processed simultaneously. This means that the database residues are not stored in memory. There is a hard-wired limitation on the maximum number of residues in any given individual database protein, though (at present, 1200 residues).

6 CUSTOM DATABASES (MKSPAZ)

If you want to add new structures to the database, or if you want to generate a small database which contains only those structures to which you want to compare your protein, use the accompanying program MKSPAZ.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ...
 Name of new SPASM library to create ? (spasm.new) spasm.custom
   
 Name of next PDB file to add ? ( ) /nfs/pdb/full/1aaz.pdb
 4-Character ID ? (/nfs) 1aaz
   
 Processing : (/nfs/pdb/full/1aaz.pdb)
 Nr of atoms : (        862)
   
 Nr of residues found   : (         87)
 Nr of residues written : (         87)
   
 Name of next PDB file to add ? ( ) /nfs/pdb/full/3cbh.pdb
 4-Character ID ? (/nfs) 3cbh
   
 Processing : (/nfs/pdb/full/3cbh.pdb)
 Nr of atoms : (        365)
   
 Nr of residues found   : (        365)
 Nr of residues written : (          0)
   
 Name of next PDB file to add ? ( ) chra.pdb
 4-Character ID ? (chra)
   
 Processing : (chra.pdb)
 Nr of atoms : (       2794)
 Resolution (A; 99.99 for NMR) ? (  99.990) 3.0
   
 Nr of residues found   : (        370)
 Nr of residues written : (        370)
   
 Name of next PDB file to add ? ( )
   
 Nr of residues total : (        457)
 Nr of proteins used  : (          2)
 ...
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

You can add as many PDB files in one run of the program as you like. Note that 3CBH only contains CA coordinates and can therefore not be used.
The new database may look as follows:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! Created by MKSPAZ V. 950421/2.0 at Fri Apr 21 23:04:43 1995 for user gerard
!
PRO 1AAZ
PDB /nfs/pdb/full/1aaz.pdb
RES   2.00
CMP GLUTAREDOXIN
MET A   1  19.791 29.971 -7.982 20.633 31.166 -7.396
PHE A   2  21.192 26.459 -8.288 19.118 24.939 -9.912
...
LYS A  87  11.780 15.553 -16.595 13.488 14.683 -19.464
END
!
PRO 3CBH
PDB /nfs/pdb/full/3cbh.pdb
RES   2.00
CMP CELLOBIOHYDROLASE /II$ CORE PROTEIN (E.C.3.2.1.91) (/CBHII$)
END
! NO residues
!
PRO CHRA
PDB chra.pdb
RES   3.00
CMP
MET A   1  12.446 27.113 66.765 12.791 23.902 65.990
LYS A   2  10.581 29.825 64.962 7.858 30.460 62.815
ILE A   3  9.320 32.550 67.058 10.543 34.217 67.524
...
VAL A 369  6.275 51.912 83.789 5.932 53.474 82.387
SER A 370  9.658 52.747 85.454 9.533 54.706 84.938
END
!
! total residues    457
! total proteins      2
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

7 SEARCH-PATERN INPUT FILE

Defining a search pattern is very simple: just make a PDB file and remove everything *except* all atoms of the residues of interest.
For example, if your catalytic residues are Asp 123, Glu 219 and Asp 382, simply make a PDB file which only contains the atoms of each of these three residues. SPASM *implicitly* assumes that you present the residues in the order in which they appear in the sequence ! This is sometimes important, but the program does *not* check if this is actually the case !
As an example, the following file contains the residues of the catalytic triad of Candida antarctica lipase B (PDB code 1TCA):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
ATOM    768  N   SER   105      -8.424  22.313  13.475  1.00  3.99      1TCA 906
ATOM    769  CA  SER   105      -8.136  21.862  14.832  1.00  5.02      1TCA 907
ATOM    770  C   SER   105      -9.268  20.966  15.394  1.00  3.52      1TCA 908
ATOM    771  O   SER   105      -9.669  20.008  14.737  1.00  3.42      1TCA 909
ATOM    772  CB  SER   105      -7.904  23.111  15.702  1.00  8.11      1TCA 910
ATOM    773  OG  SER   105      -7.320  22.766  16.938  1.00 13.88      1TCA 911
ATOM   1369  N   ASP   187       3.721  21.285  13.689  1.00  8.01      1TCA1507
ATOM   1370  CA  ASP   187       2.590  21.835  14.434  1.00  6.80      1TCA1508
ATOM   1371  C   ASP   187       3.008  21.995  15.906  1.00  6.78      1TCA1509
ATOM   1372  O   ASP   187       3.491  21.052  16.516  1.00  7.57      1TCA1510
ATOM   1373  CB  ASP   187       1.399  20.880  14.322  1.00  5.92      1TCA1511
ATOM   1374  CG  ASP   187       0.083  21.509  14.737  1.00  7.68      1TCA1512
ATOM   1375  OD1 ASP   187       0.020  22.124  15.816  1.00  6.59      1TCA1513
ATOM   1376  OD2 ASP   187      -0.895  21.386  13.979  1.00  7.36      1TCA1514
ATOM   1649  N   HIS   224       0.477  25.559  13.397  1.00  6.46      1TCA1787
ATOM   1650  CA  HIS   224      -0.921  25.162  13.569  1.00  6.60      1TCA1788
ATOM   1651  C   HIS   224      -1.880  26.075  12.788  1.00  6.98      1TCA1789
ATOM   1652  O   HIS   224      -2.807  25.591  12.123  1.00  7.10      1TCA1790
ATOM   1653  CB  HIS   224      -1.273  25.180  15.058  1.00  6.87      1TCA1791
ATOM   1654  CG  HIS   224      -2.570  24.513  15.386  1.00  7.40      1TCA1792
ATOM   1655  ND1 HIS   224      -2.851  23.212  15.037  1.00  7.13      1TCA1793
ATOM   1656  CD2 HIS   224      -3.666  24.948  16.056  1.00  7.87      1TCA1794
ATOM   1657  CE1 HIS   224      -4.041  22.847  15.458  1.00  8.67      1TCA1795
ATOM   1658  NE2 HIS   224      -4.541  23.888  16.073  1.00  8.28      1TCA1796
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The maximum number of atoms and residues in the search pattern is printed upon start-up. Any lines which do not start with "ATOM" are skipped; hydrogen atoms are ignored. Amino-acid residues are recognised by the fact that they have more than three main-chain atoms (N, CA, C, O, OTX, OT1, OT2); for example, the database contains a handful of pyroglutamate residues (type PCA). If you use a residue type "XXX", the residue will be matched against *any* residue type in the database.

8 RUNNING SPASM

SPASM is easy-to-use. One thing you need to know in advance is the location of the database file on your local computer system. In Uppsala, this is: /nfs/public/lib/spasm.lib
Once you start the program, just answer the questions (most of the defaults make sense), and let SPASM do the hard work. To explain the input etc., we shall work through an example using the pattern file shown above (Ser-Asp-His):

8.1 startup

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** Version - 950421/2.0 (C) 1993-5 Gerard J. Kleywegt, Dept. Mol. Biology, BMC, Uppsala (S) User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL) Others - T.A. Jones, G. Bricogne, Rams, W.A. Hendrickson Others - W. Kabsch, CCP4, PROTEIN, E. Dodson, etc. etc. Started - Fri Apr 21 23:07:22 1995 User - gerard Mode - interactive Host - jupiter ProcID - 19870 Tty - /dev/ttyq8 *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** Max nr of atoms in pattern file : ( 500) Max nr of residues in ,, ,, : ( 50) Ditto, in database proteins : ( 1024)

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

8.2 database file

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SPASM database file ? (/nfs/public/lib/spasm.lib) ../spasm.lib
 CPU total/user/sys :       0.0       0.0       0.0
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Provide the name of the database file on your local computer system.

8.3 search-pattern file

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- Which PDB file ? (0xyz.pdb) 1tca.pdb Nr of atoms : ( 24) 1TCA SER 105 -8.136 21.862 14.832 -7.612 22.938 16.320 1TCA ASP 187 2.590 21.835 14.434 0.152 21.475 14.714 1TCA HIS 224 -0.921 25.162 13.569 -3.157 24.098 15.511

Nr of residues found : ( 3) Nr of residues okay : ( 3) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Provide the name of the PDB file that contains your search pattern.

8.4 identifier

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Four-character ID for this run ? (1TCA)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Provide a 4-character ID for this run (used in the O macro, for instance).

8.5 cut-off values

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Enter the max RMSD for "good" hits.  If you use
 only a few residues (3-5), an RMSD < 1 A tends
 to be obtained for similar arrangements of
 residues.
 Max superpositioning RMSD ? (   1.500)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The maximum allowable value of the RMSD of database residues with your search pattern. It is always good to start with a low value (e.g., 1 A) to see if there are any *very* similar patterns in the database. If not, you can relax the value to 1.5 or 2 A and repeat the search.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 To speed up the search, any match in which at
 one of the residue-residue distances differs by
 more than a certain value are not pursued further.
 Reasonable values are 1 - 2 A.
 Max distance mismatch ? (   2.000)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

This number should be a little bit larger than the maximum RMSD; it can be relaxed further if no hits are found with more restrictive values.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 You may opt to use only structures solved at high
 resolution by supplying a resolution cut-off.
 Note: NMR structures have a resolution of 99.99 A,
 so use a cut-off > 100 if you want to include these.
 Resolution cut-off (A) ? ( 999.900) 2.5
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Supply a resolution cut-off (or a number >= 100 if you don't want to use such a cut-off).

8.6 sequence

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 You may opt to allow substitutions of certain
 residue types.  You have the following options:
  (1) Do not allow substitutions
  (2) Only allow D/E, N/Q, L/I, F/Y and R/K
  (3) Use BLOSUM-45 to decide
  (4) User-defined substitutions
 Substitution option ? (       4)
   
 Enter allowed substitutions in 3-letter code:
 Which types to allow for ARG ? (ARG)
 Which types to allow for GLN ? (GLN) gln asp glu
 Which types to allow for HIS ? (HIS)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

You now have several options to decide which substitutions to allow.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 If you want to conserve the order in which your
 residues occur in the sequence, use this constraint.
 Conserve sequence directionality ? (N) y
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

This constraint, when used, speeds up the search considerably, but you may not always want to use it.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 If you want to conserve neighbouring residues, use
 this constraint.
 Conserve neighbouring residues ? (N) y
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

This constraint is useful when you have multiple loops, helices and/or strands with gaps.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 If you want to conserve the sizes of the sequence
 gaps between the residues in your search pattern,
 use this constraint.
 Conserve sequence gaps ? (N)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

This constraint or the previous one *must* be switched on when you do main-chain searches involving sequential residues (loops, turns, etc.). Also, if you have a pattern like "GxxTxN" you may want to conserve the gaps.

8.7 output

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 You may want to see the MC/MC and/or SC/SC distance
 matrices of your search pattern and that of any hits
 found in the database, to help decide if the hit
 is good enough for your purposes.  Matrices are *only*
 printed if you search pattern contains 10 or fewer
 residues.
 Print distance matrices ? (N) y
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The distance matrices enable you to see how good the fit is, and if there are residues which "fan" more than the others.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 If you are not an O user, you may want the best
 superpositioning operator to be printed.
 Print operators ? (N)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

If you create O files, you don't need to see the operators.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 For debugging purposes, you may request extensive
 output, listing *all* database proteins which are
 tried.
 Extensive output ? (N)
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Not normally used.

8.8 main chain and/or side chain

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 You may opt to use the centres-of-gravity of the
 side-chain and/or main-chain atoms.  If you have
 few residues in your search pattern (e.g., 3), it
 is best to use both.  If you only use main-chains,
 the residue *types* are ignored.
 0=SC, 1=MC+SC, 2=MC ? (       1) 1
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

If you look for 3-5 active site residues etc., use MC+SC to get a better alignment. If you look for loops etc., use MC only.
If you look for only two residues, MC and SC atoms are always used. Note that you will get a lot of noise (false hits) if you look for only two residues, so use low values for the maximum RMSD and maximum distance difference (e.g., 0.5 and 1.0 A) !

8.9 O macro and datablock files

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- You may want an O macro plus LSQ operator file for easy inspection of the hits. Use this only once you have found a proper set of search parameters. O macro and operator file ? (N) y O macro file ? (1tca.omac)

O operator file ? (1tca.odb) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

If you have found a promising set of hits, you may want to look at them on the display. In that case, use the option to generate an O macro to do all the hard work for you. This will also generate an O datablock file which contains all the relevant LSQ operators.
The jiffy program DEJANA (part of the DEJAVU package) can be used to sort the hits in the O macro produced by SPASM !

8.10 LSQMAN input files

From version 2.2 onwards, you may opt to get an input file for the least-squares superpositioning program LSQMAN. This input file will read the coordinates, apply the operator found by SPASM, and attempt to extend the superpositioning between your model and each of the hits. This may enable you to detect more global (or: less local) similarities to other structures. In order for this to work, LSQMAN will need the PDB file which contains your complete model (i.e., not just the motif you ran through SPASM).

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- You may want LSQMAN input file to see if the superpositioning of the putative hits extends beyond the motif you have defined. LSQMAN input file ? (Y) LSQMAN input file ? (cra2.lsqman)

PDB file of your entire model ? (cra2.pdb) /nfs/pdb/full/1cbs.pdb ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9 OUTPUT

Now SPASM takes off. It reads the database proteins one at a time. It then checks to which residues each of the residues in your pattern can be matched. Normally, an "ASP" only matches an "ASP", but there are the following exceptions:

- if you use only main-chain atoms, the sequence is completely ignored, thereby enabling you to find loops etc. (this is not unlike the Lego_loop command in O);

- if you name a residue in your search pattern "XXX" (i.e., instead of ALA, ASN, etc.), it can be matched to *any* residue type;

- if you enabled the substitution option, some residue types can be matched with types other than their own (e.g., Asp-Glu).

When a "hit" is found, SPASM informs you:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- ... Searching ... ==> HIT : (1CUS) Compound : (CUTINASE (E.C.3.1.1.-)) File : (/nfs/pdb/full/1cus.pdb) Residues : ( 197) Resol (A): ( 1.250)

MATCH with RMSD 0.48 A for 6 pseudo-atoms SER 105 <---> SER 120 * ASP 187 <---> ASP 175 * HIS 224 <---> HIS 188 * Target SC distance matrix SER 105 0.0 8.1 4.7 ASP 187 8.1 0.0 4.3 HIS 224 4.7 4.3 0.0 Hit SC distance matrix SER 120 0.0 8.6 5.0 ASP 175 8.6 0.0 4.7 HIS 188 5.0 4.7 0.0 Target MC distance matrix SER 105 0.0 10.7 8.0 ASP 187 10.7 0.0 4.9 HIS 224 8.0 4.9 0.0 Hit MC distance matrix SER 120 0.0 10.5 8.2 ASP 175 10.5 0.0 5.0 HIS 188 8.2 5.0 0.0 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

It shows the ID, file and compound name of the hit and the number of residues in it. Then it lists how many possible matches there exist for each of the individual residues in your search pattern.
Subsequently, it prints the successfull match, including the RMSD. An asterisk ("*") after a matched residue means that the residue type is conserved. If you requested this, the MC/MC and/or SC/SC distance matrices in your search pattern and in the hit are shown.
Finally, the operator which superimposes the database protein with your search pattern is shown (if you don't generate an O macro, you can still use this to quickly superimpose the database protein and your own).

In this example, we find two more hits, both of which make sense (actually, with a bit more relaxed criteria, we find 6 reasonable hits):

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ==> HIT  : (1HPL)
 Compound : (LIPASE (E.C.3.1.1.3) (TRIACYLGLYCEROL HYDROLASE))
 File     : (/nfs/pdb/full/1hpl.pdb)
 Residues : (        449)
 Resol (A): (   2.300)
   
 MATCH with RMSD   1.10 A for   6 pseudo-atoms
 SER   105  <---> SER A 152  *
 ASP   187  <---> ASP A 205  *
 HIS   224  <---> HIS A 263  *
   
 ...
   
 ==> HIT  : (1TCA)
 Compound : (LIPASE (E.C.3.1.1.3) (TRIACYLGLYCEROL HYDROLASE))
 File     : (/nfs/pdb/full/1tca.pdb)
 Residues : (        317)
 Resol (A): (   1.550)
   
 MATCH with RMSD   0.00 A for   6 pseudo-atoms
 SER   105  <---> SER   105  *
 ASP   187  <---> ASP   187  *
 HIS   224  <---> HIS   224  *
   
 Nr of proteins found : (          3)
 Nr of proteins tried : (        472)
 Total number of hits : (          3)
 CPU total/user/sys :      23.6      23.1       0.5
   
 Run again ? (Y) n
   
 *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM ***
   
 Version - 950421/2.0
 Started - Fri Apr 21 23:07:22 1995
 Stopped - Fri Apr 21 23:12:55 1995
   
 CPU-time taken :
 User    -     23.1 Sys    -      0.6 Total   -     23.7
   
 *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM ***
   
 >>> This program (C) 1993-95, GJ Kleywegt & TA Jones <<<
 E-mail: "gerard@xray.bmc.uu.se" or "alwyn@xray.bmc.uu.se"
   
 *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM *** SPASM ***
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

If you want to run the program again, simply reply "Y(es)", and off you go again. By the way: note that the search took less than half a minute (on an SGI XZ; using an older database) !

10 LOOKING FOR LOOPS

As an example of the use of SPASM in locating loops, turns, etc. which are similar to those in your protein, we use residues Lys 98 - Lys 106 of holo-CRABP II (PDB code 1CBS).
The input could be as follows:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- Which PDB file ? (0xyz.pdb) cra5.pdb Nr of atoms : ( 67) CRA5 LYS 98 3.696 23.952 23.582 0.544 25.481 24.450 CRA5 LEU 99 4.357 26.721 21.041 5.674 26.130 18.846 CRA5 LEU 100 2.879 30.070 22.051 4.009 31.225 24.034 CRA5 LYS 101 2.282 30.961 18.413 3.437 33.815 19.258 CRA5 GLY 102 2.423 28.993 15.184 2.423 28.993 15.184 CRA5 GLU 103 2.988 25.348 14.377 0.150 25.311 13.383 CRA5 GLY 104 5.967 23.024 14.231 5.967 23.024 14.231 CRA5 PRO 105 7.338 19.640 15.418 8.813 20.317 14.469 CRA5 LYS 106 6.255 18.453 18.868 4.927 17.885 21.479 Nr of residues found : ( 9) Nr of residues okay : ( 9) Four-character ID for this run ? (CRA5) Enter the max RMSD for "good" hits. If you use only a few residues (3-5), an RMSD < 1 A tends to be obtained for similar arrangements of residues. Max superpositioning RMSD ? ( 1.500) 1.0 To speed up the search, any match in which at one of the residue-residue distances differs by more than a certain value are not pursued further. Reasonable values are 1 - 2 A. Max distance mismatch ? ( 2.000) You may opt to use only structures solved at high resolution by supplying a resolution cut-off. Note: NMR structures have a resolution of 99.99 A, so use a cut-off > 100 if you want to include these. Resolution cut-off (A) ? ( 999.900) You may opt to allow substitutions of certain residue types. At present, the following are hard-wired: ASP/GLU, ASN/GLN, LEU/ILE, PHE/TYR and LYS/ARG. Allow for these substitutions ? (N) If you want to conserve the order in which your residues occur in the sequence, use this constraint. Conserve sequence directionality ? (N) y If you want to conserve neighbouring residues, use this constraint. Conserve neighbouring residues ? (N) y If you want to conserve the sizes of the sequence gaps between the residues in your search pattern, use this constraint. Conserve sequence gaps ? (N) y You may want to see the MC/MC and/or SC/SC distance matrices of your search pattern and that of any hits found in the database, to help decide if the hit is good enough for your purposes. Matrices are *only* printed if you search pattern contains 10 or fewer residues. Print distance matrices ? (N) y If you are not an O user, you may want the best superpositioning operator to be printed. Print operators ? (N) For debugging purposes, you may request extensive output, listing *all* database proteins which are tried. Extensive output ? (N) You may opt to use the centres-of-gravity of the side-chain and/or main-chain atoms. If you have few residues in your search pattern (e.g., 3), it is best to use both. If you only use main-chains, the residue *types* are ignored. 0=SC, 1=MC+SC, 2=MC ? ( 1) 2 You may want an O macro plus LSQ operator file for easy inspection of the hits. Use this only once you have found a proper set of search parameters. O macro and operator file ? (N) y O macro file ? (cra5.omac)

O operator file ? (cra5.odb) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The only hit (with the parameters used above):

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- ... Searching ... ==> HIT : (1HNA) Compound : (GLUTATHIONE S-TRANSFERASE (HUMAN, CLASS MU) (GSTM2-2) FORM A (E.C.2.5.1.18) File : (/nfs/pdb/full/1hna.pdb) Residues : ( 217) Resol (A): ( 1.850) MATCH with RMSD 0.92 A for 9 pseudo-atoms LYS 98 <---> PHE 147 LEU 99 <---> LEU 148 * LEU 100 <---> GLY 149 LYS 101 <---> ASP 150 GLY 102 <---> LYS 151 GLU 103 <---> ILE 152 GLY 104 <---> THR 153 PRO 105 <---> PHE 154 LYS 106 <---> VAL 155 Target MC distance matrix LYS 98 0.0 3.8 6.4 8.8 9.9 9.3 9.7 9.9 7.7 LEU 99 3.8 0.0 3.8 5.4 6.6 6.9 7.9 9.5 8.8 LEU 100 6.4 3.8 0.0 3.8 7.0 9.0 11.0 13.1 12.5 LYS 101 8.8 5.4 3.8 0.0 3.8 6.9 9.7 12.8 13.1 GLY 102 9.9 6.6 7.0 3.8 0.0 3.8 7.0 10.6 11.8 GLU 103 9.3 6.9 9.0 6.9 3.8 0.0 3.8 7.3 8.9 GLY 104 9.7 7.9 11.0 9.7 7.0 3.8 0.0 3.8 6.5 PRO 105 9.9 9.5 13.1 12.8 10.6 7.3 3.8 0.0 3.8 LYS 106 7.7 8.8 12.5 13.1 11.8 8.9 6.5 3.8 0.0 Hit MC distance matrix PHE 147 0.0 3.8 5.6 8.3 8.4 7.7 8.0 9.6 7.2 LEU 148 3.8 0.0 3.8 6.9 6.9 7.4 7.3 9.7 8.1 GLY 149 5.6 3.8 0.0 3.8 5.7 7.7 9.2 12.3 11.3 ASP 150 8.3 6.9 3.8 0.0 3.8 7.0 9.7 13.2 13.1 LYS 151 8.4 6.9 5.7 3.8 0.0 3.8 6.4 10.2 10.9 ILE 152 7.7 7.4 7.7 7.0 3.8 0.0 3.8 7.0 8.1 THR 153 8.0 7.3 9.2 9.7 6.4 3.8 0.0 3.8 5.6 PHE 154 9.6 9.7 12.3 13.2 10.2 7.0 3.8 0.0 3.8 VAL 155 7.2 8.1 11.3 13.1 10.9 8.1 5.6 3.8 0.0

Nr of proteins found : ( 1) Nr of proteins tried : ( 580) Total number of hits : ( 1) CPU total/user/sys : 124.0 123.6 0.4 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

11 INTERFACE TO O

The O macro generated by the second example above, looks as follows:

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! Created by SPASM V. 950421/2.0 at Fri Apr 14 23:23:15 1995 for user gerard
! Search pattern from cra5.pdb
! LYS    98
! LEU    99
! LEU   100
! LYS   101
! GLY   102
! GLU   103
! GLY   104
! PRO   105
! LYS   106
read cra5.odb
sam_atom_in cra5.pdb CRA5 PDB
mol CRA5
pa_case atom_z 4 6 7 8 16 green cyan magenta yellow
object CRA5 ca ; end
centre_xyz   4.12  25.52  18.25
!
sketch_setup stick smooth 0.1 8
sketch_setup sphere smooth 0
db_create .cpk_radii 110 r
db_set_dat .cpk_radii ; 0.2
sketch_stick CRA5
sketch_cpk CRA5
!
! HIT 1HNA
! GLUTATHIONE S-TRANSFERASE (HUMAN, CLASS MU) (GSTM2-2) FORM A (E.C.2.5.1.18
sam_atom_in /nfs/pdb/full/1hna.pdb 1HNA PDB
! Hit nr      1
! RMSD   0.92
! LYS    98  <---> PHE   147
! LEU    99  <---> LEU   148  *
! LEU   100  <---> GLY   149
! LYS   101  <---> ASP   150
! GLY   102  <---> LYS   151
! GLU   103  <---> ILE   152
! GLY   104  <---> THR   153
! PRO   105  <---> PHE   154
! LYS   106  <---> VAL   155
mol 1HNA delete 1HNA1  ; object 1HNA1
ca ; end
lsq_obj 1HNA1_to_CRA5 1HNA1
on_off bell message Done
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The file with the LSQ operators looks as follows:

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
! Created by SPASM V. 950421/2.0 at Fri Apr 14 23:23:15 1995 for user gerard
.lsq_rt_1HNA1_to_CRA5 r 12 (6f12.6)
    0.894604   -0.413836   -0.168596   -0.296522   -0.832013    0.468859
   -0.334304   -0.369451   -0.867033   25.347286   31.042042   30.984348
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Now start O, type "@cra5.omac", go and have a cup of coffee if you have many hits, and admire the result ...

NOTE: the macro will *only* work if the PDB file names in the SPASM database file actually point to the corresponding PDB files on your local file system. To this end, your local system manager should do something like this:

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 unix> sed -e 's%/nfs/pdb/full%/your/pdb/directory/%' spasm.lib > local.lib
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

12 NON-LOCAL SIMILARITIES

By running LSQMAN with the input file produced by SPASM (optional; only available in version 2.2 and later), you may detect similarities between your model and any of the hits. LSQMAN will read your complete model, and then for each of the hits apply the operator found by SPASM, and subsequently try to find more residues which are superimposed through that operator (or an "improved" version). In addition, LSQMAN will produce a new O macro with the (improved) operators for the hits.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 unix> run lsqman < cra6.lsqman >& cra6_lsqman.out
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

After this, you can of course use the DEJAVU companion program DEJANA to sort the hits and remove those you're not impressed by. See the DEJAVU manual for details.

13 LOCAL PDB DIRECTORY STRUCTURE

Some labs mirror the PDB directory structure (i.e., the one where entry 1CBS goes into a subdirectory "cb", and where the file is called "pdb1cbs.ent" instead of "1cbs.pdb"). Morten Kjeldgaard contributed the following to help you generate a SPASM database with MKSPAZ.

      
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
Hiya Gerard,
   
I had a problem generating a 'spasm.lib' file because we are mirroring the
PDB directory structure, (e.g.  pdb1aak.ent is in aa/pdb1aak.ent) so I
could not do a simple sed job on spasm.lib. PDB supplies a script hash.pl
to generate a dbm database relating entry code to filename.
   
Therefore I had to re-run mkspaz with the list of 1563 pdb ident codes, but
I found that it was not straightforward to create the input file as most
entries are in different directories, and because the mkspaz program wants
extra input when the file contains an NMR structure. So I created this
little perl script to generate the mkspaz input file from a list of pdb
idents. Run it by
   
        rsdb < names
   
The script is useful for people mirroring the PDB directory structure. I
thought you might wanna include it in yer spasm manual.
   
Cheers from MOK!
   
PS: Now I have my very personalized spasm.lib file! Whooy!
   
PPS: This is my first (input file generator (input file generator)) program
;-)
   
----8<-- *snip* -----
   
#!/usr/sbin/perl
# Make an input file for mkspaz from a list
# of PDB ident codes. mok 980409.
   
# define the location of the pdb index files...
$dir = "/pdb/index";
   
print "spazzzzzm.lib\n";
dbmopen(%loc,"$dir/loc",0644);
   
while (<>) {
    $id = $_;
    chop ($id);
    $filename = $loc{$id};
   
    # only do something if the file exists...
    if ( -s $filename) {
        open(FILE, $filename);
   
        print "$filename\n";
        print "$id\n";
   
        # check if this is an nmr structure...
        do {
            $_ = <FILE>;
            chop;
        } until (/^(EXPDTA|ATOM)/) ;
        $nmr = m/NMR/;
        if ($nmr) {
            print "100.0\n";
        }
    } else {
        print STDERR "$filename does not exist...\n";
    }
}
   
--
Morten Kjeldgaard                                | e-mail: mok@imsb.au.dk
Institute of Molecular and Structural Biology    | Phone : +45 89 42 50 26
Aarhus University                                | Fax   : +45 86 20 12 22
Gustav Wieds Vej 10, DK-8000 Aarhus C, Denmark   | Home  : +45 86 18 81 80
----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

14 KNOWN BUGS

None, at present.

Created at Tue Mar 2 00:20:37 1999 by MAN2HTML version 971024/1.6