Fasta Specifics
- How does Fasta work?
- How do I interpret fasta results?
- What are Scoring Matrices?
- What is Dayhoff Mutation Data Matrix?
- What is BLOSUM?
- How do I search my own sequences with FASTA?
- Why does FASTA sometimes not display any one-line descriptions or alignments?
- I have a 12 residue peptide. I want to translate it to DNA and then search DNA databases?
How does Fasta work?
The early personal computers had insufficient memory and were too slow to carry out a database scan using dynamic programming. Accordingly, Wilbur and Lipman developed a fast procedure for DNA scans that in concept searches for the most significant diagonals in a dot-plot.
The initial step in the algorithm is to identify all exact matches of length k(k-tuples) or greater between the two sequences.
Speed is achieved by employing a look up procedure. For example, for proteins, if k==3 then there are 8,000 (20^3) possible k-tuples and each element of an array C of length 8,000 is set to represent one of these k-tuples.
Sequence A is scanned once and the location of each k-tuple in A is recorded in the corresponding element of C.
Sequence B is then scanned and by reference to C the location of all k-tuple matches common to A and B may be identified.
If two k-tuples are present on the same diagonal then the difference between their starting position (offset) is also the same, thus the diagonals with the most significant number of matches may be identified.
Since runs of identity are relatively rare even between related proteins, Lipman and Pearson first identified the five diagonals of highest similarity with set to 1, or 2. They then applied Dayhoff's scoring scheme to the amino acid pairs over these regions. The region giving the highest score for the protein comparison was used to rank order the sequences located in the databank for further study by more rigorous procedures. Pearson and Lipman have refined these ideas in the program FASTA. FASTA saves the 10 highest regions of identity which are then re-scored with the PAM250 matrix.
If there are several initial regions above a pre set cutoff score then those that could form a longer alignment are joined, allowing for gaps and a score initn is calculated by subtracting a penalty for each gap. initn is used to rank the database sequences by similarity.
Finally, dynamic programming is used over a narrow region of the high scoring diagonal to produce an alignment with score opt.
FASTA only shows the top scoring region, it does not locate all high scoring alignments between two sequences. As a consequence FASTA may not identify directly repeats or multiple domains that are shared between two proteins.
How do I interpret fasta results?
The first part of the output file contains a histogram showing the number of overlapping regions between the query and search set sequences that were observed for each score. The histogram is integrated into bins that are of size 2 (for proteins) or 4 (for nucleic acids). For a nucleic acid query sequence, the histogram would normally show the frequency of overlapping regions with scores of 1 to 4, 4 to 8, 9 to 12, and so forth.
The top score for each bin is listed in the leftmost column of the histogram. The second and third columns list the number of init1 and initn scores that fall within each bin. (See the ALGORITHM topic for an explanation of init1 and initn scores.) In the histogram itself, each symbol represents two sequences. The init1 and initn scores are represented by minus (-) and plus (+) symbols, respectively. If the init1 and initn scores are the same in a bin, or if both scores exceed the limit that the histogram can display (100 scores), they are both represented by equals (=) symbols. The mean scores for the entire search are displayed at the bottom of the histogram, along with their standard deviations in parentheses.
Below the histogram, FastA displays a listing of the best scores. /rev after the sequence name in this list indicates that the search set sequence overlaps with the bottom (reverse-complement) strand of the query sequence.
Following the list of best scores, FastA displays the alignments of the regions of best overlap between the query and search sequences. A /rev following the query sequence name indicates that the search sequence is aligned with the bottom strand of the query sequence.
This program displays only the region of overlap between the two aligned sequences unless you put -SHOWall on the command line. The display of identities and conservative replacements between the aligned sequences depends on the value of the -MARKx command-line option. By default ( -MARKx=3), the pipe character (|) is used to denote identities and the colon (:) to denote conservative replacements.
What are Scoring Matrices?
These are tables used, particularly in protein/protein comparisons to give computer programs some biological knowledge of amino acids with similar properties, rather than just treating all substitutions as equivalent.
All algorithms to compare protein sequences rely on some scheme to score the equivalencing of each of the 210 possible pairs of amino acids. (i.e. 190 pairs of different amino acids + 20 pairs of identical amino acids).
Most scoring schemes represent the 210 pairs of scores as a 20 x 20 matrix of similarities where identical amino acids and those of similar character (e.g. I, L) give higher scores compared to those of different character (e.g. I, D). Since the first protein sequences were obtained, many different types of scoring scheme have been devised.
The general consensus is that matrices derived from observed substitution data (e.g. the Dayhoff or BLOSUM matrices) are superior to identity, genetic code or physical property matrices. However it seems likely that as more protein three dimensional structures are determined, substitution tables derived from structure comparison will give the most reliable data.
Higher PAM/Blosum matrices will tend to find longer, weaker local alignments whereas the lower matrix values will tend to find short alignments of highly similar sequences. It is thus a good idea to do sequence comparisons with a variety of scoring matrices. Reasonable defaults are PAM250 or Blosum62.
What is Dayhoff Mutation Data Matrix ?
Possibly the most widely used scheme for scoring amino acid pairs is that developed by Dayhoff and co-workers. The system arose out of a general model for the evolution of proteins. Dayhoff and co workers examined alignments of closely similar sequences where the the likelihood of a particular mutation (e. A-D) being the result of a set of successive mutations (eg. A-x-y-D) was low. Since relatively few families were considered, the resulting matrix of accepted point mutations included a large number of entries equal to 0 or 1. A complete picture of the mutation process including those amino acids which did not change was determined by calculating the average ratio of the number of changes a particular amino acid type underwent to the total number of amino acids of that type present in the database. This was combined with the point mutation data to give the mutation probability matrix (M) where each element Mij gives the probability of the amino acid in column j mutating to the amino acid in row j after a particular evolutionary time, for example after 2 PAM (Percentage of Acceptable point Mutations per 10^8 years).
The mutation probability matrix is specific for a particular evolutionary distance, but may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. At the level of 2,000 PAM Schwartz and Dayhoff suggest that all the information present in the matrix has degenerated except that the matrix element for Cys-Cys is 10% higher than would be expected by chance. At the evolutionary distance of 256 PAMs one amino acid in five remains unchanged but the amino acids vary in their mutability; 48% of the tryptophans, 41% of the cysteines and 20% of the histidines would be unchanged, but only 7% of serines would remain.
When used for the comparison of protein sequences, the mutation probability matrix is usually normalised by dividing each element Mij by the relative frequency of exposure to mutation of the amino acid i. This operation results in the symmetrical relatedness odds matrix with each element giving the probability of amino acid replacement per occurrence of i per occurrence of j. The logarithm of each element is taken to allow probabilities to be summed over a series of amino acids rather than requiring multiplication. The resulting matrix is the log-odds matrix which is frequently referred to as Dayhoff's matrix and often used at a distance of close to 256 PAM since this lies near to the limit of detection of distant relationships where approximately 80% of the amino acid positions are observed to have changed.
What is BLOSUM?
BLOSUM is the matrix from ungapped alignments, an alternative approach which has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences.
First a database of multiple alignments without gaps for short regions of related sequences was derived.
Within each alignment in the database, the sequences were clustered into groups where the sequences are similar at some threshold value of percentage identity.
Substitution frequencies for all pairs of amino acids were then calculated between the groups and this used to calculate a log odds BLOSUM (blocks substitution matrix) matrix.
Different matrices are obtained by varying the clustering threshold. For example, the BLOSUM 80 matrix was derived using a threshold of 80% identity.
How do I search my own sequences with Fasta?
You can go to the EBI FASTA Service
Why does Fasta sometimes not display any one-line descriptions or alignments?
Fasta has recently changed the way it scores the significance of the matches it finds. It now works out the number of times any particular score value of a match would be seen by chance if a random sequence had been used and will not report any match that is expected more than 10 times (for a DNA match) or 2 times (for a protein match).
You had only matches that were expected more than 10 times and so it didn't report them.
I have a 12 residue peptide. I want to translate it to DNA and then search DNA databases?
"I have a short, human, amino acid sequence 12 residues. I want to find out all the possible DNA sequences that might encode these amino acids. How do I do this I'm used to GCG, but not Staden etc. Once I have the set of possible DNA sequences, I want to search the DNA, including EST, database for possible homologues."
Use the Fasta option of GCG. This Fasta option automatically selects the appropriate Fasta program, so if your sequence is a protein sequence and you choose a nucleic acid database like 'emblminus' or 'est' then it will use the TFASTA program which translates the nucleic acid database entries in all six frames.
You might like to set the threshold for reporting matches to Expect=20 instead of the default of Expect=10. This will find fainter similarities.
For an even more sensitive search, set the TFASTX option to be true. This uses the TFASTX program instead of the TFASTA program. TFASTX deals with frameshifts better than TFASTA. If you choose this option you should be prepared to wait some time for the results.
Copyright © 1996-2008,