C687 Tutorial: Sequence Alignments
This set of exercises is intended to expose you to keyword and sequence-similarity searching of WWW sequence databases,
and to provide you with an opportunity to experiment with sequence-fetching & manipulation tools.
This tutorial should require less than 90 minutes. However, please take extra time to search the WWW for comutational biochemistry
resources---the time you invest NOW will be very useful during your Independent Modeling Project and your research.
All but the last section of this tutorial (conversion of FASTA-format file to HOMOLOGY-format file) can be performed on any computer
-- Mac, PC, or UNIX -- that has access to the web.
Part Zero: Prepare before Class
Review the WWW Database-Searching and Sequence Alignment Notes from lecture #2.
Review the links to other WWW sites listed in lecture #2.
Part One: BLAST searching as a "forensic" tool
Goal: Use the NCBI BLAST server (or another sequence-similarity engine if you'd prefer) to determine
- the type of protein
- the type of metabolic process in which this protein is involved
- the general type of organism from whic this protein is found
for the following amino acid sequence:
>mystery1
NDPVLRAKLAKGMGHNYYGEPAWPNDLLYIFPVVILGTIACNVGLAVLEP
SMIGEPADPFATPLEILPEWYFFPVFQILRTVPNKLLGVLLMVSVPAGLL
TVPFLENVNKFQNPFRRPVATT
Hint:Because you are only interested in sequences that are closely related to the query sequence, changing the BLAST options to only report the top ten matches will probably save you time...
Save your answers in a file named sequence_tutorial.txt (e.g., use the jot editor).
Try other scoring matrices. Do you retrieve the same results? Try other parameters. Click on the parameter names to
obtain more information about the parameter. Click on the "?" icon at the top of the page to learn more about BLAST.
Part Two: Finding protein structural information on the internet
Goal: Surf the net for a datafile containing the structure of the protein synthesis Elongation Factor Tu from a bacterium.
Notes: Which database you use to you find your structure is not important here. In fact, I hope that you will be brave and search several different databases to see what you can find! Just be sure that the structural data you find are in a form that will allow you to work with the information (e.g., with a structure-viewing program or a modeling package like InsightII).
"Cut & Paste" all necessary information into your sequence_tutorial.txt file.
Part Three: Building a sequence dataset from WWW sequence databases
Goal: Use sequence-similarity and keyword searches to find other bacterial Elongation Factor Tu sequences.
- Extract the protein sequence from the structure file you located in the previous exercise and use it as the query sequence for a
BLAST search.
How you extract the sequence will vary from database to database,
but in most cases there will either be a plaintext sequence at the end of the structure file or
a link to a SwissProt or Genbank file containing the sequence embedded in the structure file.
For this BLAST search, you should set the BLAST options to give you a large number of matches, so that you can...
- compare these results to the results of a keyword search of the same database. Use either the
keyword search engine, or use the
ENTREZ interface. Be sure to constrain your keyword search to bacterial sequences, or you are likely to get hundreds of matches.
"Cut & Paste" all necessary information into your sequence_tutorial.txt file.
About 50 bacterial Elongation Factor Tu sequences that have been determined. How does this compare with your BLAST results?
How does this compare with your keyword search results? Add your answer to these questions in your sequence_tutorial.txt file.
Part Four: Downloading the components of your sequence dataset
Goal: Learn how to fetch the sequences that you have identified as being related to your sequence of known structure.
- Pick four of the sequences you found in your BLAST or keyword searches to align with the sequence from the
structure file in part two. In a text file in your account (use the jot editor), record the accession numbers
of these four sequences PLUS the reference sequence (the sequence from part two).
(Accession numbers are the numbers and/or letters that precede the description of the sequence -- e.g., X07898 is
the accession number for one Genbank/EMBL/DDBJ entry.)
- Using one of the following four methods, download the sequences.
- Do it the old fashioned way -- pull up each sequence file in turn and either save to file or extract the sequence information by "cut"-ing it out of Netscape and "paste"-ing it into a text file. (On the SGI machines, cut-and-paste can be accomplished by selecting the material to be copied in one window with the LEFT mouse button, putting the cursor in the window you wish to paste into, and pressing the MIDDLE mouse button.)
- Using ENTREZ, pull up all five sequences, select them, and instruct ENTREZ to save them to a file.
- Using BATCH-ENTREZ, submit your file of accession numbers to the server for retrieval. Be sure that the accession numbers in your file are arranged one to a line, with no other characters.
- Use the RETRIEVE email server to submit a list of accession numbers for retrieval. (I can show you how to do this, or you can get the help documentation by sending the server an email message containing only the word "help".)
Add the information from ONE of these methods to your sequence_tutorial.txt file.
Part Five: Sequence file formats
Goal: Become familiar with the FASTA file format
- Open your sequence file in a text editor.
- Depending on what method you used to fetch your sequences, all or some or none of your sequences will be in FASTA format. Therefore, the first thing to do is to convert them all to this format. Although there are conversion utilities that will do this for you, you will be better able to troubleshoot future problems if you really understand FASTA format, so I'm asking you to do the conversions the hard way this time. Before you start hacking up your file, though, it's always a good idea to save a copy in case you mistakenly delete some sequence in the process of reformatting...
- Regardless of how you obtained your sequences, it is very likely that the descriptions which precede the sequence data will contain either more or less information that you will want in future manipulations of these data. Therefore you will need to revise the sequence labels so that
- they mean something to you -- that is, they tell you what that sequence is, and
- they are composed of ten characters or less.
While some sequence editing or analysis programs will accept long sequence labels, many will truncate your labels after the first ten characters, so creating short, informative labels (that lack spaces, dashes, or periods) is a good habit to get into. In fact, if you use these sequences in Insight manipulations, you'll need to trim them down to SIX characters or fewer!
- Check to make sure that the format of your completed sequence file matches the FASTA Example File.
If it doesn't, the next exercise will fail in potentially fascinating ways...
Part Six: CLUSTAL -- a commonly-used sequence alignment program
Goal: Use a web-based implementation of CLUSTAL to create a series of alignments of your sequences.
- CLUSTAL is the most widely-used program for building sequence alignments
- It can be run on Macintoshes, PCs, UNIX machines, and is running on several machines that you can access over the web
- The advantage of using CLUSTAL over the web is that it is easy and user-friendly (no software to install, and the local-computer versions have terrible user interfaces)
- The disadvantages to using it over the web are that it can be slow and
there are limitations on datafile size (typically no more than 20 sequences of no more than 1000 characters each)
Use the CLUSTAL site at the Alignment Tools
of Baylor College of Medicine's Search Launcher, which also allows you to use alignment engines other than CLUSTAL.
When you prepare to build your alignment, you can select among a large number of program options by clicking on the [O]
link after the name CLUSTAL. Although many of these settings have little impact on the outcome of the alignment,
review the Notes about Gap Penalties from Lecture #2. These parameters are
of critical importance; To learn more about the effect of these gap parameters:
- align your set of sequences with the gap opening penalty set to 10 and the gap extension penalty set to 1.
- do it again with gap opening at 2 and extension at 1
- do it again with gap opening at 1 and extension at 1
- do it one more time with gap opening at 1 and extension at 5
- if that hasn't exhausted your patience, I encourage you to experiment with the other settings
-- such as the amino acid classes and the scoring matrix
"Cut & Paste" all necessary information into your sequence_tutorial.txt file.
Part Seven: Getting ready for homology modeling
Goal: Practise converting from FASTA format to Insight's Homology module format
How to convert from FASTA to
HOMOLOGY file formats
- Download the fasta2homology.pl program.
You also need the perl compiler on your SGI workstation; all departmental SGI
workstations in the Department of Chemistry have the perl compiler.
- Type chmod 755 fasta2homology.pl to inform the UNIX operating system that this
is an executable program.
- Type one of the following:
- fasta2homology.pl
The program will ask you for the FASTA input file name and the Homology output file name.
- fasta2homology.pl FASTA_file_name
The program will ask you for the Homology output file name.
- fasta2homology.pl FASTA_file_name Homology_file_name
Notes about file names:
- Of course, you must have a FASTA-format file.
- You may choose any name for the Homology file that will be created.
However, if you choose a file name that does not end with .align,
the file name will be appended with .align;
Homology menus only recognize files that end with .align.
- If you choose an output file name and a file already exists with that name,
the program will terminate instead of
overwriting the exisiting file. Re-run the program and choose a different output
file name, or delete the existing
file before re-running the program.
Please note that FASTA-format files have sequence identifiers that may have many
characters. Homology-format files
have sequence identifiers with 6 characters or less. Therefore, if you have
two different identifiers in your
FASTA-format file which are identical for the first 6 characters, these identifiers
will be truncated to the first 6
characters in the Homology-format file, and they will appear to be part of the same
sequence. If this occurs, rename
your conflicting identifiers in the FASTA-format file.
Part Eight: Verify that you have completed this portion of the assignment
See the Homology Assignment page for details.
Back to | C687 Spring 1999 |
Courses & Instruction | MolViz
Home |
Send comments to chemvis@indiana.edu
Last updated: 01/23/2001