USF

NEWS FROM THE UPPSALA SOFTWARE FACTORY - 4

A super position

Gerard J. Kleywegt & T. Alwyn Jones
Department of Molecular Biology
Biomedical Centre, Uppsala University
Uppsala - Sweden

A very common task for any protein crystallographer is the superpositioning of macromolecules which are more or less identical (NCS-related molecules, identical structures solved in different labs, in different spacegroups or with different methods, mutants or complexes of a certain protein, etc.) or display structural similarities. If two molecules have ~90 % or more sequence identity, an explicit superpositioning of the two structures can be carried out by specifying which atoms are to be matched in the two molecules, and calculating the rotation/translation operator which minimises the sum of the (squares of) the distances between corresponding atoms. If the sequences are more distantly related, or even unrelated, the problem becomes less trivial. In particular, defining the similarity becomes ambiguous, and many papers have been written about methods to accomplish this.
We have written a program called LSQMAN which can be used to obtain "optimal" structural alignments of structures with any level of sequence homology, where the definition of "optimal" can be largely controlled by the user. The program was originally written in order to quickly sort out "good" from "bad" hits found by DEJAVU [1], our program to detect folding similarities, before analysing them in detail on the display using O [2, 3]. Nevertheless, the program can be used independently from both DEJAVU and O, and it can be used just as easily with proteins, nucleic acids and other molecules.
The simplest task, superimposing molecules given two sets of atoms which should be matched, is easily accomplished [4, 5]. The implementation is very similar to that used in O, except that the atom types which should be used are freely definable. This means that one may use, for instance, only Ca atoms, backbone atoms, all (non-hydrogen) atoms, or a set of user-defined atom types (e.g., in the case of nucleic acids or small molecules). An example:

LSQMAN > ex m1 a1-999 m1 b1
WARNING - mol1 == mol2 !
Explicit fit of M1 A1-999
And M1 B1
Atom types |NONH|
Nr of atoms to match : ( 3499)
The 3499 atoms have an RMS distance of 2.311 A
RMS delta B = 7.802 A2
Corr. coeff. = 0.9031
Rotation : 0.382393 -0.058393 0.922153
-0.033219 -0.998225 -0.049435
0.923402 -0.011729 -0.383654
Translation : 5.715 16.617 -8.061

Note that, apart from the RMS distance of the atoms after superpositioning, the RMS DB and the linear correlation coefficient of the temperature factors of the matched atoms are calculated as well. In the case of NCS-related molecules, and that of very similar molecules, one would expect RMS DB to be of the order of ~3-5 Å2, and the correlation coefficient to be greater than ~0.95.
Note that LSQMAN cannot automatically detect the optimal alignment of two molecules as some other programs do [6]. Usually, sets of matching atoms are either trivial to define (e.g., NCS-related molecules, mutants, complexes), or non-trivial. In the latter case, we use DEJAVU [1] first to carry out a rough alignment of the secondary-structure elements of the protein of interest and all other proteins in the PDB that appear to show structural similarities. The rough alignments are then improved with LSQMAN.

* IMPROVING OPERATORS
Optimal alignment of structures with low sequence homology is somewhat arbitrary, since "optimal" involves both the number of structurally equivalent residues, and their RMS distance after alignment. LSQMAN uses a similar operator-improvement algorithm as that employed by O [2, 3], i.e.: using an initial operator, consecutive fragments of residues (using their Ca atoms, for example) are located whose length exceeds a certain minimum number of residues, and whose distance to the corresponding atoms is less than a certain cut-off. These fragments are used to calculate a new, explicit operator, and the process is iterated until it converges. Note that this algorithm is insensitive to sequence gaps so that it can be used both to find the best-conserved fragments in similar molecules, and to find the common core of two completely different molecules. The implementation in LSQMAN contains some extra "embellishments":
* a sequentiality constraint (optional). If two proteins have a common motif with the same topology, this is a useful constraint; on the other hand, if two structures contain similar arrangements of helices and strands, but in a different order in their sequences, this constraint would be switched off.
* the two cut-offs (minimum number of consecutive residues in matched fragments, and maximum distance between equivalenced atoms) can either be kept fixed, or allowed to "decay". For example, one could start with a distance cut-off of 4 Å to get the overall operator relating the two molecules, and then multiply this cut-off by a factor of 0.95 in every iteration to "zoom in" on the structurally most similar core fragments of the two.
* the optimisation criterion can be selected by the user. At present, LSQMAN can optimise: (1) the number of matched residues (maximise); (2) the RMS distance of the matched residues (minimise); (3) the Similarity Index (SI; minimise); or (4) the Match Index (MI; maximise). The Similarity Index is defined as:

                   RMSD * min(N1,N2)
            SI = ---------------------
                         Nm

where: N1,2 = number of residues in molecule 1 and 2, Nm = number of matched residues, and RMSD = their RMS distance. SI assumes values >= 0.0 Å; the lower the value of SI, the better the fit and the more similar the two molecules are. The Match Index is defined as:

                                (1 + Nm)
            MI = --------------------------------------
                   (1 + W * RMSD ) * (1 + min(N1,N2))

where W is positive weight (the higher the weight, the bigger the influence of the RMSD on the value of MI; suggested values for W are between 0.1 and 1). MI assumes values between 0 and 1, where "0" indicates a "perfect mis-match" and "1" a perfect match.
After the operator improvement has converged (or a maximum number of cycles has been carried out), the structure-based sequence alignment is printed. The matched residues are shown, along with the distance of the atoms that were used (usually, Ca atoms). If two residues are of the same type, an asterisk is printed as well. Also, some statistics pertaining the number and percentage of matched and conserved residues are printed. An example:

Found fragment of length : ( 53)
Found fragment of length : ( 260)
Found fragment of length : ( 57)
Found fragment of length : ( 59)

Cycle : ( 10)
Distance cut-off (A) : ( 3.800)
Min fragment length (res) : ( 5)
The 428 atoms have an RMS distance of 0.946 A
SI = RMS * Nmin / Nmatch = 1.01260
MI = (1+Nmatch)/(1+W*RMS)*(1+Nmin) = 0.48022
RMS delta B for matched atoms = 7.610 A2
Corr. coefficient matched atom Bs = 0.908
Rotation : 0.38169697 -0.06605943 0.92192382
-0.04122496 -0.99766684 -0.05441866
0.92336768 -0.01723484 -0.38352972
Translation : 5.7764 17.2442 -8.0352

Fragment SER-A 4 <===> SER-B 4 @ 2.43 A *
SER-A 5 <===> SER-B 5 @ 1.11 A *
ARG-A 6 <===> ARG-B 6 @ 1.19 A *
TYR-A 7 <===> TYR-B 7 @ 0.40 A *
VAL-A 8 <===> VAL-B 8 @ 0.49 A *
ASN-A 9 <===> ASN-B 9 @ 0.21 A *
LEU-A 10 <===> LEU-B 10 @ 0.90 A *
[...]
GLY-A 456 <===> GLY-B 456 @ 3.40 A *
VAL-A 457 <===> VAL-B 457 @ 3.68 A *

Nr of residues in mol1 : ( 459)
Nr of residues in mol2 : ( 458)
Nr of matched residues : ( 428)
Nr of identical residues : ( 428)
% identical of matched : ( 100.000)
% matched of mol1 : ( 93.246)
% identical of mol1 : ( 93.246)
% matched of mol2 : ( 93.450)
% identical of mol2 : ( 93.450)

Statistics can be obtained with the SHow_operator command:

The 428 atoms have an RMS distance of 0.946 A
SI = RMS * Nmin / Nmatch = 1.01260
MI = (1+Nmatch)/(1+W*RMS)*(1+Nmin) = 0.48022
RMS delta B for matched atoms = 7.610 A2
Corr. coefficient matched atom Bs = 0.908
[...]
NCSOP 1 = 0.3816970 -0.0412250 0.9233677 5.776
-0.0660594 -0.9976668 -0.0172348 17.244
0.9219238 -0.0544187 -0.3835297 -8.035
Determinant of rotation matrix = 1.000000

Crowther Alpha Beta Gamma 178.93069 -112.55250 3.37809
Spherical polars Omega Phi Chi 123.71790 177.77631 178.71825
Direction cosines of rotation axis -0.83114 0.03227 -0.55510
Dave Smith -2.57299 -22.79103 -173.83571
Rotation angle = 178.718246

*POOR* - NCS not restrained
*POOR* - NCS Bs not restrained

* OTHER FEATURES
After operator improvement, for example using Ca atoms, the RMS distance of any set of atoms can be calculated with the RMsd_calc command. Operators can be stored as, or read from O datablock files; they can be edited, and they can be applied to a molecule, for example for display purposes. In addition, an O macro can be generated automatically which will read the appropriate PDB files, apply the current operator(s), and display the Ca traces of the superimposed molecules.
LSQMAN
was originally written as a fast filter between DEJAVU and O. DEJAVU looks for proteins in the PDB which appear to display structural similarities to another protein [1]. However, often many false hits are found, especially if only weak similarities are present (in which case one has to use very relaxed search criteria). DEJAVU produces an O macro to carry out the structural alignment and to display the hits, but this takes quite some time to execute. Using LSQMAN in between, one has a very quick means of separating "the men from the boys". Therefore, DEJAVU can be instructed to produce an input file for LSQMAN to carry out the operator improvement stage for all hits. LSQMAN, in turn, will produce an O macro to display the hits superimposed on the search structure using the automatically improved operators. This macro can be edited to remove all hits which were false (recognised by either few matched residues, and/or a very large RMSD).
Finally, an interesting way of analysing differences between similar molecules is provided by the option to produce a DPHI/DPSI plot (essentially, a "difference Balasubramanian plot"), as suggested by Korn and Rose [7]. Plots of RMSD as a function of residue number usually fail to discriminate between "random" differences and localised differences. For instance, if a structure has undergone a domain movement around one or two hinge residues, and one superimposes the structures using only one domain, the other domain will have high RMSDs, even though the secondary and tertiary structure of the second domain as a whole may be conserved. In such a case, one would expect the DPHI/DPSI plot to be fairly flat, with some spikes at the hinge residues. Also, DPHI/DPSI plots show peptide flips between two structures as isolated spikes (the PSI angle of residue i, and the PHI angle of residue i+1 will differ by more than ~150(o)). On the other hand, if two NCS-related molecules have been heavily over-modelled (quite common at low resolution when no NCS constraints or restraints were used in the refinement), this will show up as a very noisy DPHI/DPSI plot. This plot facility, together with the calculation of RMSD and RMS DB values, makes that LSQMAN is also a useful tool for analysing NCS-related molecules, and assessing the quality of their refinement.

* AVAILABILITY
LSQMAN
is one in a series of "O-dalisques", i.e. programs that work in conjunction with O. LSQMAN runs on SGI, ESV and DEC ALPHA/OSF1 workstations. For more information, contact GJK (E-mail: "gerard@xray.bmc.uu.se").

* REFERENCES
[1]
G.J. Kleywegt & T.A. Jones, in "From First Map to Final Model" (S. Bailey, R. Hubbard & D. Waller, Eds.), SERC Daresbury Laboratory (1994), pp. 59-66.
[2]
T.A. Jones, J.Y. Zou, S.W. Cowan & M. Kjeldgaard, Acta Cryst. A47 (1991), 110-119.
[3]
T.A. Jones & M. Kjeldgaard, "O - the manual", Uppsala (1994).
[4]
W. Kabsch, Acta Cryst. A32 (1976), 922-923.
[5]
W. Kabsch, Acta Cryst. A34 (1978), 827-828.
[6]
K. Diederichs, J. Appl. Cryst. 27(1994), 436.
[7]
A.P. Korn & D.R. Rose, Prot. Engin. 7 (1994), 961-967.
[8]
I. Sinning, G.J. Kleywegt, S.W. Cowan, P. Reinemer, H.W. Dirr, R. Huber, G.L. Gilliland, R.N. Armstrong, X. Ji, P.G. Board, B. Olin, B. Mannervik & T.A. Jones, J. Mol. Biol. 232 (1993), 192-212.


USF Latest update at 12 February, 1998.