C687 Tutorial: Sequence Alignments


This set of exercises is intended to expose you to keyword and sequence-similarity searching of WWW sequence databases, and to provide you with an opportunity to experiment with sequence-fetching & manipulation tools. This tutorial should require less than 90 minutes. However, please take extra time to search the WWW for comutational biochemistry resources---the time you invest NOW will be very useful during your Independent Modeling Project and your research.

All but the last section of this tutorial (conversion of FASTA-format file to HOMOLOGY-format file) can be performed on any computer -- Mac, PC, or UNIX -- that has access to the web.


Part Zero: Prepare before Class

Review the WWW Database-Searching and Sequence Alignment Notes from lecture #2. Review the links to other WWW sites listed in lecture #2.


Part One: BLAST searching as a "forensic" tool

Goal: Use the NCBI BLAST server (or another sequence-similarity engine if you'd prefer) to determine for the following amino acid sequence:

>mystery1
NDPVLRAKLAKGMGHNYYGEPAWPNDLLYIFPVVILGTIACNVGLAVLEP
SMIGEPADPFATPLEILPEWYFFPVFQILRTVPNKLLGVLLMVSVPAGLL
TVPFLENVNKFQNPFRRPVATT

Hint:Because you are only interested in sequences that are closely related to the query sequence, changing the BLAST options to only report the top ten matches will probably save you time...

Save your answers in a file named sequence_tutorial.txt (e.g., use the jot editor).

Try other scoring matrices. Do you retrieve the same results? Try other parameters. Click on the parameter names to obtain more information about the parameter. Click on the "?" icon at the top of the page to learn more about BLAST.


Part Two: Finding protein structural information on the internet

Goal: Surf the net for a datafile containing the structure of the protein synthesis Elongation Factor Tu from a bacterium.

Notes: Which database you use to you find your structure is not important here. In fact, I hope that you will be brave and search several different databases to see what you can find! Just be sure that the structural data you find are in a form that will allow you to work with the information (e.g., with a structure-viewing program or a modeling package like InsightII).

"Cut & Paste" all necessary information into your sequence_tutorial.txt file.


Part Three: Building a sequence dataset from WWW sequence databases

Goal: Use sequence-similarity and keyword searches to find other bacterial Elongation Factor Tu sequences.
  1. Extract the protein sequence from the structure file you located in the previous exercise and use it as the query sequence for a BLAST search. How you extract the sequence will vary from database to database, but in most cases there will either be a plaintext sequence at the end of the structure file or a link to a SwissProt or Genbank file containing the sequence embedded in the structure file. For this BLAST search, you should set the BLAST options to give you a large number of matches, so that you can...
  2. compare these results to the results of a keyword search of the same database. Use either the keyword search engine, or use the ENTREZ interface. Be sure to constrain your keyword search to bacterial sequences, or you are likely to get hundreds of matches.

"Cut & Paste" all necessary information into your sequence_tutorial.txt file.

About 50 bacterial Elongation Factor Tu sequences that have been determined. How does this compare with your BLAST results? How does this compare with your keyword search results? Add your answer to these questions in your sequence_tutorial.txt file.


Part Four: Downloading the components of your sequence dataset

Goal: Learn how to fetch the sequences that you have identified as being related to your sequence of known structure.
  1. Pick four of the sequences you found in your BLAST or keyword searches to align with the sequence from the structure file in part two. In a text file in your account (use the jot editor), record the accession numbers of these four sequences PLUS the reference sequence (the sequence from part two). (Accession numbers are the numbers and/or letters that precede the description of the sequence -- e.g., X07898 is the accession number for one Genbank/EMBL/DDBJ entry.)
  2. Using one of the following four methods, download the sequences.

Add the information from ONE of these methods to your sequence_tutorial.txt file.


Part Five: Sequence file formats

Goal: Become familiar with the FASTA file format
  1. Open your sequence file in a text editor.
  2. Depending on what method you used to fetch your sequences, all or some or none of your sequences will be in FASTA format. Therefore, the first thing to do is to convert them all to this format. Although there are conversion utilities that will do this for you, you will be better able to troubleshoot future problems if you really understand FASTA format, so I'm asking you to do the conversions the hard way this time. Before you start hacking up your file, though, it's always a good idea to save a copy in case you mistakenly delete some sequence in the process of reformatting...
  3. Regardless of how you obtained your sequences, it is very likely that the descriptions which precede the sequence data will contain either more or less information that you will want in future manipulations of these data. Therefore you will need to revise the sequence labels so that While some sequence editing or analysis programs will accept long sequence labels, many will truncate your labels after the first ten characters, so creating short, informative labels (that lack spaces, dashes, or periods) is a good habit to get into. In fact, if you use these sequences in Insight manipulations, you'll need to trim them down to SIX characters or fewer!
  4. Check to make sure that the format of your completed sequence file matches the FASTA Example File. If it doesn't, the next exercise will fail in potentially fascinating ways...

Part Six: CLUSTAL -- a commonly-used sequence alignment program

Goal: Use a web-based implementation of CLUSTAL to create a series of alignments of your sequences. Use the CLUSTAL site at the Alignment Tools of Baylor College of Medicine's Search Launcher, which also allows you to use alignment engines other than CLUSTAL.

When you prepare to build your alignment, you can select among a large number of program options by clicking on the [O] link after the name CLUSTAL. Although many of these settings have little impact on the outcome of the alignment, review the Notes about Gap Penalties from Lecture #2. These parameters are of critical importance; To learn more about the effect of these gap parameters:

  1. align your set of sequences with the gap opening penalty set to 10 and the gap extension penalty set to 1.
  2. do it again with gap opening at 2 and extension at 1
  3. do it again with gap opening at 1 and extension at 1
  4. do it one more time with gap opening at 1 and extension at 5
  5. if that hasn't exhausted your patience, I encourage you to experiment with the other settings -- such as the amino acid classes and the scoring matrix

"Cut & Paste" all necessary information into your sequence_tutorial.txt file.


Part Seven: Getting ready for homology modeling

Goal: Practise converting from FASTA format to Insight's Homology module format

How to convert from FASTA to HOMOLOGY file formats

  1. Download the fasta2homology.pl program. You also need the perl compiler on your SGI workstation; all departmental SGI workstations in the Department of Chemistry have the perl compiler.
  2. Type chmod 755 fasta2homology.pl to inform the UNIX operating system that this is an executable program.
  3. Type one of the following:
Notes about file names: Please note that FASTA-format files have sequence identifiers that may have many characters. Homology-format files have sequence identifiers with 6 characters or less. Therefore, if you have two different identifiers in your FASTA-format file which are identical for the first 6 characters, these identifiers will be truncated to the first 6 characters in the Homology-format file, and they will appear to be part of the same sequence. If this occurs, rename your conflicting identifiers in the FASTA-format file.


Part Eight: Verify that you have completed this portion of the assignment

See the Homology Assignment page for details.


Back to  |  C687 Spring 1999  |  Courses & Instruction  |  MolViz Home  |
Send comments to chemvis@indiana.edu
Last updated: 01/23/2001