Uppsala Software Factory - STRUPAT Manual

1 STRUPAT - GENERAL INFORMATION
2 REFERENCES
3 VERSION HISTORY
4 INTRODUCTION
5 INPUT TO THE PROGRAM

5.1 Start-up

5.2 Random-number seed

5.3 Random sequence

5.4 Cut-off distances and frameshifts

5.5 Minimum pattern length

5.6 Little variation

5.7 PDB file
6 OUTPUT
7 RESULTS
8 PATTERN REDUCTION
9 KNOWN BUGS
10 UNKNOWN BUGS

1 STRUPAT - GENERAL INFORMATION

Program : STRUPAT
Version : 971030
Author : Gerard J. Kleywegt, Dept. of Cell and Molecular Biology, Uppsala University, Biomedical Centre, Box 590, SE-751 24 Uppsala, SWEDEN
E-mail : gerard@xray.bmc.uu.se
Purpose : generate PROSITE patterns from aligned 3D protein structures
Package : SBIN

2 REFERENCES

Reference(s) for this program:

* 1 * G.J. Kleywegt & T.A. Jones (1998). Databases in protein crystallography. Acta Cryst D54, 1119-1131. [http://alpha2.bmc.uu.se/gerard/papers/databases.html] [http://www.iucr.org/iucr-top/journals/acta/tocs/actad/1998/actad5406_1.html]

* 2 * G.J. Kleywegt & T.A. Jones (1999 ?). Chapter 25.2.6. O and associated programs. Int. Tables for Crystallography, Volume F. To be published.

3 VERSION HISTORY

970512 - 0.1 - first version
970804 - 0.5 - first documented version
970805 - 0.6 - try to extend alignments backwards as well; minor changes
971030 - 1.0 - cleaned up code and manual

4 INTRODUCTION

This program generates PROSITE patterns from a set of aligned three-dimensional protein structures in PDB format.

Suppose that you solve a new protein structure which turns out to contain a fold which is (partly) similar to that of one or more other proteins (e.g., using DEJAVU or SPASM). If you align the two structures (e.g., using LSQMAN), you can feed them into the program STRUPAT which will look for more or less conserved residues in structurally conserved regions. It will use these to generate PROSITE-type sequence patterns (a.k.a. footprints, fingerprints, motifs, ...).

Such a pattern may look as follows: G-x(3)-C-x(2)-[ILV]. This means: glycine - three residues of any type - cysteine - two residues of any type - one residue of type Ile/Leu/Val. A protein which contains the peptide GYAVCPSV would fit this pattern.

If you want to scan PROSITE ( http://www.expasy.ch/sprot/prosite.html ) patterns against the SWISS-PROT (and TREMBL) database, you can use the WWW-based PROSITE server ( http://www.expasy.ch/sprot/scnpsit2.html ) at ExPASy in Geneva.

5 INPUT TO THE PROGRAM

5.1 Start-up

When you start the program, it prints some information:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** Version - 971020/0.9 (C) 1992-97 Gerard J. Kleywegt, Dept. Mol. Biology, BMC, Uppsala (S) User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL) Others - T.A. Jones, G. Bricogne, Rams, W.A. Hendrickson Others - W. Kabsch, CCP4, PROTEIN, E. Dodson, etc. etc. Started - Thu Oct 30 13:49:11 1997 User - gerard Mode - interactive Host - sarek ProcID - 24190 Tty - /dev/ttyq18 *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** Reference(s) for this program: * 1 * G.J. Kleywegt, Uppsala University, Uppsala, Sweden, Unpublished program. For manuals and complete references, check: http://alpha2.bmc.uu.se/usf/ *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT ***

Max nr of atoms/residues : ( 100000) Max nr of molecules : ( 100) Max nr of residues in sequence : ( 1000) Max nr of PROSITE patterns : ( 100) Random sequence length : ( 2000000) One-letter codes : ( A R N D C E Q G H I L K M F P S T W Y V) Three-letter codes : ( ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL) ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.2 Random-number seed

The first bit of input is an integer seed for the random-number generator. This will be used to generate a random amino-acid sequence. If you repeat this run of the program on the same machine with the same seed, you should be getting identical results.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Random-number seed ? (  123456)
 Random-number seed : (  123456)
 => Random number generator initialised with seed :     123456
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.3 Random sequence

The program will now generate a random amino-acid sequence of (at present) 2,000,000 residues. This sequence has an aminoa-acid distribution similar to that found in proteins in the PDB (GJK, unpublished results). It will be used later to test how often generated PROSITE patterns occur in this sequence, which gives you some idea of the pattern occurring by chance. Of course, a random sequence is unlikely to be "protein-like", but if a pattern matches the random sequence more than, say, 5 or 10 times, it is unlikely to be a very discriminating one.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Generating random sequence ...
 Target composition    : (   0.081    0.044    0.046    0.058    0.019
  0.058    0.037    0.080    0.022    0.053    0.081    0.059    0.020
  0.040    0.047    0.068    0.063    0.016    0.038    0.071)
 Working ...
 Actual composition    : (   0.081    0.044    0.046    0.058    0.019
  0.057    0.037    0.080    0.022    0.053    0.081    0.060    0.020
  0.040    0.046    0.068    0.063    0.015    0.038    0.070)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.4 Cut-off distances and frameshifts

You are to provide a cut-off distance (in Å) for CA atoms of different molecules to be considered equivalent. If this number is very high, frameshifts may occur in the structural alignments, although the program can be instructed to try and correct for these. Another cut-off distance determines how bits of equivalent structure are extended at their ends.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Equivalent CA distance ? (   3.500) 5
 Equivalent CA distance : (   5.000)
   
 Extension CA distance ? (   6.000) 8
 Extension CA distance : (   8.000)
   
 Try to correct frame-shifts (Y/N) ? (Y)
 Try to correct frame-shifts (Y/N) : (Y)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.5 Minimum pattern length

Very short patterns are unlikely to be very specific. Also, for calculating RMSDs between aligned stretches, at least 3 CA atoms are required.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Min pattern length ? (      10) 6
 Min pattern length : (       6)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.6 Little variation

If only 2, 3 or 4 different residue types occur in a certain position of all structures/sequences, this can be included in the pattern. But this only makes sense if you have a reasonable number of structures. For instance, if you only have three structures, and you observe residue types Arg, Lys, and Gln in a certain position, you probably would not want to conclude that this residue is always Arg, Lys or Gln. However, if you have 30 aligned structures, you might.

There are a few exceptions to this, namely if the various observed residue types are similar, such as: D/E, R/K, F/Y, F/Y/W, N/Q, S/T, A/G, and I/L/V.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 If only 2, 3, or 4 different residue types
 occur in at least NMIN2, NMIN3, NMIN4 of
 your sequences, an entry will be generated
 (e.g., [SE], [TGW], [KILM]).  By setting
 NMIN2/3/4 greater than the number of sequences
 you can prevent that such entries are used.
 Value for MIN2 (>2) ? (       6)
 Value for MIN3 (>3) ? (      15)
 Value for MIN4 (>4) ? (     100)
 Value for MIN2 : (       6)
 Value for MIN3 : (      15)
 Value for MIN4 : (     100)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.7 PDB file

Provide the name of the PDB file which contains ALL molecules. Note that the molecules must have been superimposed previously (e.g., with O or LSQMAN; LSQMAN contains a BRute_force command to find structural alignments "ab initio"). Any two subsequent molecules in the file must have different chain identifiers. However, not all identifiers have to be unique (which would otherwise limit you to a maximum of 26 molecules), e.g. you could alternate chain identifiers A and B. Note that the program *ONLY* reads the CA atoms, so you can make your files considerably smaller by only including these (e.g.: grep ^ATOM myfile.pdb | grep ' CA ' > new.pdb).

The example below is for a PDB file which contains a number of superimposed glutathione S-transferase structures.

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- Name of PDB file ? (aligned.pdb) aligned.pdb Name of PDB file : (aligned.pdb) Nr of CA atoms : ( 1712) Nr of molecules : ( 8)

Mol # 1 Atoms 1 to 221 Mol # 2 Atoms 222 to 430 Mol # 3 Atoms 431 to 651 Mol # 4 Atoms 652 to 858 Mol # 5 Atoms 859 to 1076 Mol # 6 Atoms 1077 to 1293 Mol # 7 Atoms 1294 to 1495 Mol # 8 Atoms 1496 to 1712 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

6 OUTPUT

STRUPAT will now start looking for residues that are structurally equivalent in all aligned structures (i.e., a residue in the first protein has a partner in each of the other structures within the cut-off distance). When it encounters such a residue, it checks to see if neighbouring residues (on either side) also have partners in all the other structures (now using the second distance cut-off).

In this way, a set of residues is equivalenced between all structures. However, the structural superposition may not always be optimal, so the program will try to detect and fix any frameshift errors. It does this simply by checking for each structure if shifting the alignment to the first structure by one residue forward or backward would improve the superpositioning RMSD. If so, the equivalenced residues are altered accordingly, and the frameshift test is carried out again, until no more frameshifts occur.

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- ---------------------------------------------------------------------- Shift mol 2 by -1 (RMSD -1/0/+1 : 4.3 6.4 8.9 A) Shift mol 3 by -1 (RMSD -1/0/+1 : 1.8 3.7 6.3 A) Shift mol 4 by -1 (RMSD -1/0/+1 : 4.2 6.3 8.8 A) Shift mol 7 by -1 (RMSD -1/0/+1 : 3.9 6.0 8.3 A)

---------------------------------------------------------------------- ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

At that stage, the program will again try to extend the alignments in both directions using the extension distance cut-off. If the resulting conserved set of residues contains at least the minimum number of residues defined by the user, a potential pattern has been found.

For every (potential) pattern that the program discovers, the output includes:

- a listing of the first residue of the stretch of structurally conserved residues in every molecule

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 New structurally conserved stretch !
 Starts at residue TYR -   49
   molecule    2 @ THR -   46
   molecule    3 @ TYR -   49
   molecule    4 @ SER -   44
   molecule    5 @ GLY -   49
   molecule    6 @ GLY -   53
   molecule    7 @ THR -   44
   molecule    8 @ GLY -   53
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- for every residue, the amino-acid type in each molecule, and the program's "reduction" of it in terms of PROSITE pattern elements. For instance, a strictly conserved glycine will be "reduced" to "G", whereas "|YFFY|" would yield "[FY]".

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 |YTYSGGTG| ==>  X -
 |LCLCLLML| ==>  X -
 |MLMLEDYD| ==>  X -
 |FYFFFFSF| ==>  X -
 |QGQRPPNP| ==>  X -
 |QQQQNNAN| ==>  X -
 |VLVLLLML| ==>  X -
 |PPPPPPPP| ==>  P -
 |MKMKYYVY| ==>  X -
 |VFVFYLLL| ==>  X -
 |EEEQIIDI| ==>  X -
 |IDIDDDID| ==>  [ID] -
 |DGDGGGDG| ==>  [DG] -
 |GDGDDTGS| ==>  X -
 |MLMLVHTR| ==>  X -
 |KTKTKKKK| ==>  [KT] -
 |LLLLLIMI| ==>  X -
 |VYVYTTST| ==>  X -
 |QQQQQQQQ| ==>  Q -
 |TSTSSSSS| ==>  [ST] -
 |RNRNMNMN| ==>  X -
 |AAAAAACA| ==>  [AC] -
 |IIIIIIII| ==>  I -
 |LLLLILAM| ==>  X -
 |NRNRRRRR| ==>  [NR] -
 |YHYHYYHY| ==>  [YH] -
 |ILILIILL| ==>  [IL] -
 |AGAGAAAA| ==>  [AG] -
 |SRSRDRRR| ==>  X -
 |KSKSKKEK| ==>  X -
 |YLYFHHFH| ==>  X -
 |NGNGNNGH| ==>  X -
 |LLLLMLLL| ==>  [LM] -
 |YYYYLCDC| ==>  X -
 |GGGGGGGG| ==>  G -
 |KKKKGEKE| ==>  X -
 |DNDDCSTT| ==>  X -
 |IQIQPESE| ==>  X -
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- some information about the aligned set of residues, namely their number, and the RMS (RMSD) value (in Å). This is calculated from all Nmol*(Nmol-1)/2 possible pair-wise superpositionings of this stretch of residues.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of residues  : (      38)
 RMS (RMSD) (A)  : (   1.545)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- now the program goes to work and "reduces" the partial PROSITE patterns, by collecting sequential "X"s and stripping any "X"s from the start and end

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 PROSITE pattern : (P - X(3) - [ID] - [DG] - X(2) - [KT] - X(2) - Q - [ST]
  - X - [AC] - I - X - [NR] - [YH] - [IL] - [AG] - X(4) - [LM] - X - G)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- the program calculates two scores to help you judge the value of the pattern.

"Score 1" is calculated as SUM 10LOG(Ntotal/Nposs), where the sum extends over all residues in the pattern, Ntotal is the number of aligned structures, and Nposs the number of different residue types that occur in each position IF (and only if) the residue resulted in a non-"X" partial pattern. E.g., if there are four different residue types for four sequences, (Ntotal/Nposs) will be 1, and the contribution to the sum of logs will be zero. If there are only two possible residue types observed in 30 different structures, the contribution will be 10LOG(20/2), since the maximum number of possible different residue types is 20. The higher the total sum, the more specific information the pattern contains. Usually, this is strongly correlated to the length of the pattern.

"Score 2" is an integer number calculated as a sum over all residues of the pattern of a subjective score of the quality of the pattern element. The subjective score varies between 0 (for an "X" entry) to 10 (for a strictly conserved residue type).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Score 1         : (  21.775)
 Score 2         : (      77)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- finally, the program will check the random sequence it prepared earlier to see how often the pattern occurs in it. If it occurs more than a few times, the pattern is probably not suitable for searching againts a database, since it is likely to result in many false positives.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of matches to random sequence : (          0)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

7 RESULTS

When the program has finished, it will print a summary of the PROSITE patterns of sufficient length.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of PROSITE patterns found : (       4)
   
    # Leng  Score  #Random  PROSITE pattern
    1   28   21.8        0  P-X(3)-[ID]-[DG]-X(2)-[KT]-X(2)-Q-[ST]-X-[AC]-I-X-
                            [NR]-[YH]-[IL]-[AG]-X(4)-[LM]-X-G
    2    6    2.8    28128  [EG]-X(4)-[DR]
    3    8    2.8    33922  [DP]-X(6)-[LA]
    4   21   11.7        0  [FC]-P-X-[IL]-X(5)-R-X(6)-[IV]-X-[KA]-[FY]-[LM]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Scanning SWISS-PROT with the first pattern (using the PROSITE WWW server) yields (at 971030) 45 hits, all of which are glutathione S-transferase sequences. Pattern number 4 yields only 38 hits, but all of these are GST sequences as well. However, a search of SWISS-PROT revealed that there were 122 GST sequences in the database, so only about a third is retrieved.

We can run the program with different parameters:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Equivalent CA distance ? (   3.500)
 Equivalent CA distance : (   3.500)
   
 Extension CA distance ? (   6.000) 3.5
 Extension CA distance : (   3.500)
   
 Try to correct frame-shifts (Y/N) ? (Y)
 Try to correct frame-shifts (Y/N) : (Y)
   
 Min pattern length ? (      10) 5
 Min pattern length : (       5)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In that case, the following patterns are found:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of PROSITE patterns found : (       3)
   
    # Leng  Score  #Random  PROSITE pattern
    1   18   12.6        0  [IL]-X-Y-[FW]-X(3)-G-X(5)-R-X-[LV]-L-[AE]
    2    6    4.9     1461  P-X(3)-[ID]-[DG]
    3   18   14.8        0  [KT]-X(2)-Q-[ST]-X-[AC]-I-X-[NR]-[YH]-[IL]-[AG]-
                            X(4)-[LM]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The first pattern gives 42 matches (all GST sequences), and the third 45 (all GST).

8 PATTERN REDUCTION

The program uses a simple algorithm to reduce a string of residue types to a PROSITE sub-pattern. At present (971030), there are 12 possible cases:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 CASE #     "Score 2"   Description
 ------     ---------   ---------------------------------------------------------
    1          10       absolutely conserved residue type
    2           8       (potentially) negatively charged residue [DE]
    3           8       (potentially) positively charged residue [RK]
    4           8       only Phe and Tyr occur
    5           6       only Phe and Tyr and Trp occur
    6           5       only Asn and Gln occur
    7           5       only Ser and Thr occur
    8           5       only Ala and Gly occur
    9           6       only Ile and Leu and Val occur
   10           3       only 2 different types occur and at least NMIN2 sequences
   11           2       only 3 different types occur and at least NMIN3 sequences
   12           1       only 4 different types occur and at least NMIN4 sequences
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9 KNOWN BUGS

None, at present ("peppar, peppar").

10 UNKNOWN BUGS

Does not compute.

Created at Fri Dec 18 19:42:29 1998 by MAN2HTML version 971024/1.6