C687: Lecture 2

Database Searching and Sequence Alignments

January 11, 1999

Instructor: Marty Pagel


Outline of this Lecture

  1. Searching WWW databases (15 minutes)
  2. Aligning primary sequences (20 minutes)
  3. Building a Homology Model (15 mintes)
  4. 10-minute break
  5. Evaluating the Quality of a Protein Model (20 minutes)


Questions to guide you in searching biosequence databases

Consider these questions before you begin a search of biochemical sequence databases. Careful consideration will help you to more effectively select the database(s) to explore, the tools to use in searching the database(s), and the parameters to use in employing those tools.

  1. What do you have?
    • A sequence
    • A published report
    • An evolutionary idea
    • A structural idea

  2. What will you search for?
    • Gene identification
    • Clues to gene function
    • More organisms with this gene
    • Information for constructing an evolutionary model
    • Information for constructing a structural model

  3. Where will you search?
    Consider redundancy within databases and
    between databases.

  4. How will you search?

  5. How will you know when you've found it?

Aligning Sequences

What is an alignment?


How are sequences aligned?

Alignments based on Amino Acid Sequences

  1. Scoring Matrices:

  2. Gap & Extension Penalties:
    Very Sensitive to Penalty Weights
    An Example:
    FOUR POSSIBLE ALIGNMENTS OF TWO SEQUENCES
    matches are shown in UPPERCASE letters
    unmatched residues are shown in lowercase letters
    gaps are shown with "-" characters
    -----------------------------------------
    
    SEQ1   ATGCGggACaTG
    SEQ2   AgGCG--cC-TG   (7 matches, 1 gap of 1 bp, 1 gap of 2 bp)
    or 
    SEQ1   ATGCGGgaCaTG
    SEQ2   AgGCGc--C-TG   (7 matches, 1 gap of 1 bp, 1 gap of 2 bp) 
    or 
    SEQ1   ATGCGggaCATG
    SEQ2   AgGCG---CcTG   (7 matches, 1 gap of 3 bp)
    or 
    SEQ1   ATGCGgGaCaTG
    SEQ2   AgGCG-c-C-TG   (7 matches, 3 gaps of 1 bp)
    
    
    
    SCORES FOR THESE FOUR POSSIBLE ALIGNMENTS UNDER DIFFERENT GAP PENALTIES
    -----------------------------------------------------------------------
    --------------------------------------------
    Gap Opening Penalty    0  1   1   1
    Gap Extension Penalty  0  0  0.1  1
    --------------------------------------------
    Alignment 1            7  5  4.9  4
                          
    Alignment 2            7  5  4.9  4
                          
    Alignment 3            7  6  5.8  4
                          
    Alignment 4            7  4   4   4
    

  3. Sequence Alignment Methods:
    • Pair-wise alignments by Global Optimization:
      The two entire amino acid sequences are directly compared.
      Fully automatic and guaranteed to converge on one solution.

    • Multiple sequence alignments by Globally Optimization: Fully automatic and guaranteed to converge on one solution.
      CPU & memory requirements limit this technique to the aligment of < 5 sequences.

    • Iterative multiple alignments from pairwise global sub-alignments.
      Two sequences are aligned, then third sequence is aligned to consensus sequence, etc.
      A fast method, but depends on the order in which the sequences are compared---can miss large regions of similarity.

    • Multiple alignments based on local similarities.
      Alignment scores of "blocks" of residues in sequences are compared to the alignment scores of a (population-weighted) random sequence and the sequences. If a sequence's alignment score is greater than 6 times the standard deviation of a random sequence, then the alignment contains structurally conserved regions. A very fast, very popular sequence alignment method.

  4. Alignments based on Structurally Conserved Regions:
    • Manual Superpositioning & RMSD calculations to determine SCRs
    • Automatic determination of SCRs via distance matrix of alpha-Carbon coordinates

      Comparing Segments of
      2 alpha-Carbon matrices

      Validation of Sequence Alignments
      by Off-Diagonal alpha-Carbon
      Matrix Elements

    Limitations:

    1. matrix analyses don't include secondary structure information. SCRs should be terminated at the end of secondary structural elements (e.g., beta sheets should have separate SCRs for each strand).
    2. may have difficulty for regularly-repeating segments (e.g., repeating sequences in helices) or pseudo-symmetric dimers.

  5. Evolutionary Models:
    Align sequences weighted by "distances" between host species in evolutionary or phylogenetic tree.
    Phylogeny schemes:


Sequence Alignments in Practice

Fetching the sequences

Pearson/FASTA sequence format

Alignment engines

Manually Editing Alignments

Other uses of sequence alignments besides Homology Modeling


Building the Homology Model

General Methodology for Homology Protein Modeling:

  1. Determine which proteins are related to the unknown protein.

  2. Determine Structurally Conserved Regions (SCRs).

  3. Align the amino acid sequence of the unknown protein with those of the reference protein(s) within the SCRs.

  4. Assign coordinates in the SCRs of the unknown protein.

  5. Predict conformations for the rest of the peptide chain, including loops between the SCRs and N- & C-termini.
    Scan the database of alpha-carbon positions in the PDB for loops that have the same "loop base".

    Definition of the PREFLEX and POSTFLEX "Stem" Regions

    Definition of the Geometry of the "Base"

  6. Secondary Structure Prediction of large regions
    • Chou-Fasman method (and related methods):
      The propensity for the residue in a secondary structure type is evaluated in the context of frequency that amino acid type is found in a particular secondary structure.
    • GOR II method (and related methods): The propensity for a residue in a secondary structure type is evaluated in the context of the frequency that the amino acid type within a window of it's 8 residues upstream and 8 residues downstream is found in a particular secondary structure.

  7. Search for optimum side-chain rotamer conformations for new side chains.
    80% of identical residues and 70-75% of mutated residues have the same rotomeric state in homologous proteins. Therefore, homology models adopt the same roameric state for all side chains where possible. For mutations to longer side chains, statistically-prefered angles are chosen (e.g., for most side chains with 2 chi angles, there are 4 commonly-seen conformations of 9 possible gauche-anti conformers). Specific types of rotamers are also found with certain structural motifs. User-defined "moving" side chains can be automatically changed iteratively until there is no change in energy.

  8. Use energy minimization and/or restrained molecular dynamics to refine the model. (To be discussed later in the course).


Evaluating the Quality of Protein Models

Options available in the ProStat menu of the Homology Module of InsightII are shown in bold. Options available in the InsightII Viewer Module are shown in italics.
Stereochemical Accuracy
  • Torsion angles
    Mainchain torsion angle distribution (Ramachandran plots)
    Sidechain torsion angle distributions
  • Planarity of peptide bonds and peptide bond angle distribution
  • Chirality of alpha-Carbon atoms
  • Bond lengths
  • Bond angles
  • planarity of aromatic ring systems and sp2-hybridized end groups
Packing Quality
  • Interatomic Distances
    "Bump" check
    "Atomic contact" quality
  • Secondary Structural elements
    location and geometry
  • hydrophobicity
    distribution of polar and nonpolar amino acids on surface & interior
  • Solvent accessible surface of amino acids
  • Unsatisfied buried hydrogen bond donors/acceptors
Folding Reliability
  • 3D comparison of model/template structure
    RMS deviations between backbone atoms of superimposed structures
  • Knowledge-based potentials
    Energy comparisons of homology models are NOT accurate enough to determine which homology model is correct.
    Incorrectly folded protein models almost always have larger surface areas than correctly folded proteins.

Back to  |  C687 Spring 1999  |  Courses & Instruction  |  MolViz Home  |
Send comments to chemvis@indiana.edu
Last updated: 01/23/2001