Advanced Site Search

Blast Specifics


What is BLAST?

BLAST(Basic Local Alignment Search Tool) is a heuristic method to find the highest scoring locally optimal alignments between a query sequence and a database.

What is gapped BLAST?

A gapped BLAST search allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely.Previous versions of blast did not allow gapped alignments, but BLAST2 does.

How does BLAST work?

The BLAST algorithm and family of programs rely on work on the statistics of ungapped sequence alignments by Karlin and Altschul. The statistics allow the probability of obtaining an alignment (MSP - Maximal Segment Pair) with a particular score to be estimated. The BLAST algorithm permits nearly all MSP's above a cutoff to be located efficiently in a database.

The algorithm operates in three steps:

  • For a given word length w (usually 3 for proteins) and score matrix a list of all words (w-mers) that can can score >T when compared to w-mers from the query is created.
  • The database is searched using the list of w-mers to to find the corresponding w-mers in the database (hits).
  • Each hit is extended to determine if an MSP that includes the w-mer scores >S, the preset threshold score for an MSP. Since pair score matrices typically include negative values, extension of the initial w-mer hit may increase or decrease the score. Accordingly, a parameter X defines how great an extension will be tried in an attempt to raise the score above S.

A low value for T reduces the possibility of missing MSPs with the required S score, however lower T values also increase the size of the hit list generated in step 2 and hence the execution time and memory required. In practice, the BLASTP program used for protein searches sets compromise values of T and X to balance the processor requirements and sensitivity.

BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found. The program was developed at the NCBI and benefits from strong technical support and continuing refinement. For example, filters have recently been developed to exclude automatically regions of the query sequence that have low compositional complexity, or short periodicity internal repeats. The presence of such sequences can yield extremely large numbers of statistically significant but biologically uninteresting MSPs. For example, searching with a sequence that contains a long section of hydrophobic residues will find many proteins with transmembrane helices.

How many programs does BLAST family have,and who are they?

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases. (Most of the time use of these is transparent, behind an interface.)

  • blastp - compares an amino acid query sequence against a protein sequence database
  • blastn - compares a nucleotide query sequence against a nucleotide sequence database
  • blastx - compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database
  • tblastn - compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands)
  • tblastx - compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database
  • PSI-BLAST - Position-Specific Iterated BLAST. This is a potentially very sensitive method to pull out significant hits in a protein-protein database search. This first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found.

The default matrix for all protein-protein comparisons is BLOSUM62.The default matrix for all protein-protein comparisons is BLOSUM62.

How does I interpret BLAST results?

The BLAST report consists of a number of sections. The descriptions below are for a blastp comparison, but the format for the other programs is analogous.

The BLAST report is not intended to be a parseable document. It is subject to change with little or no notice.

The BLAST report starts with some header information that lists the type of program (here blastp), the version (here 2.0.1), and a release date. Also listed are a reference to the BLAST program, the query definition line, and summary of the database used.
For example

BLASTP 2.0.1 [Aug-20-1997]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs",
Nucleic Acids Res. 25:3389-3402.

Query= gi|129295|sp|P01013|OVAX_CHICK gene X protein - chicken (fragment)
         (232 letters)

Database: Non-redundant SwissProt sequences
           59,576 sequences; 21,219,450 total letters


One-line descriptions of the database matches found are presented next.  These
include a database sequence identifier, the corresponding definition line, as
well as the score (in bits) and the statistical significance ('E value') for this
match (please see the section on statistics for an explanation of bits and
significance).  Consider the output below, from a gapped blastp comparison of
SwissProt accession P01013 against the SwissProt database.

                                                                    High    E
Sequences producing significant alignments:                        Score  Value

sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)               442  e-124
sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED)               353  9e-98
sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II)      278  5e-75
sp|P19104|OVAL_COTJA OVALBUMIN                                        268  5e-72
sp|P48595|BOMA_HUMAN BOMAPIN (PROTEASE INHIBITOR 10)                  199  2e-51
sp|P29508|SCC1_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 1 (SCCA-1) ...   198  5e-51
sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (LEUCOCYTE...   197  1e-50
sp|P48594|SCC2_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 2 (SCCA-2) ...   196  2e-50
sp|P50453|PTI9_HUMAN CYTOPLASMIC ANTIPROTEINASE 3 (CAP3) (PROTEA...   195  6e-50
sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI)               193  2e-49        
    

The first match, in this case, is the actual query sequence. The identifiers shown here are all from SwissProt, so they all have 'sp' in the first field, followed by the accession, and then a Locus name. The syntax of these identifiers is discussed in more detail in the appendices of ftp://ftp.cbi.pku.edu.cn/pub/databases/blast/db/blastdb.html. The definition lines are taken from the definition line in the database, with the ellipsis (e.g., P29508) indicating that the definition line was too long to for the space available.

Ungapped alignments and results from blastx and tblastn will have an additional column ('N'), displaying the number of different segment pairs used to produce the alignment, according to the Karlin-Altschul statistics.

Each alignment is preceded by the sequence identifier, the full definition line and the length of the database sequence. Next come the score (in bits as well as the raw score) as well as the statistical significance of the match, followed by the number of identities and positive matches according to the scoring system (e.g., BLOSUM62) and, if applicable, the number of gaps in the alignment. Finally the actual alignment is shown, with the query on top and the database match labeled as 'Sbjct'. Between the two sequences the residue is shown if it is conserved, a '+' is shown if there is a positive match. One or more dashes, '-', indicates insertions or deletions. The example below is the third sequence listed in the one-line descriptions above.

>sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II)

          Length = 386

Score =  278 bits (744), Expect = 5e-75

Identities = 149/231 (64%), Positives = 182/231 (78%), Gaps = 2/231 (0%)

Query 2   IKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNS 61
          I+++L  SS D  T +VLVNAI FKG+W+ AF  EDT+ MPF VT+QESKPVQMM
Sbjct 158 IRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGL 217

Query 62  FNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKR 121
          F VA++ +EKMKILELPFASG +SMLVLLPDEVS LE++E  INFEKLTEWT+ N ME+R
Sbjct 218 FRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEER 277

Query 122 RVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSE 181
          ++KVYLP+MK+EEKYNLTSVLMA+G+TD+F  SANL+GISSAESLKISQAVH A  E++E
Sbjct 278 KIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINE 337

Query 182 DGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232
           G E+ GS      +  +  SE+FRADHPFLF IKH  TN +++FGR  SP
Sbjct 338 AGREVVGSAEA--GVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP 386

The last section lists specifics about the database searched as well as statistical and search parameters used:

Database: Non-redundant SwissProt sequences

    Posted date:  Aug 14, 1997  9:52 AM

Number of letters in database: 21,219,450
Number of sequences in database:  59,576

Lambda     K      H

   0.317    0.132    0.377

Gapped

Lambda     K      H

   0.255   0.0350    0.190


Matrix: BLOSUM62

Gap Penalties: Existence: 10, Extension: 1
Number of Hits to DB: 8938654
Number of Sequences: 59576
Number of extensions: 335248
Number of successful extensions: 1188
Number of sequences better than 10: 116
Number of HSP's better than 10.0 without gapping: 106
Number of HSP's successfully gapped in prelim test: 10
Number of HSP's that attempted gapping in prelim test: 868
Number of HSP's gapped (non-prelim): 120
length of query: 232
length of database: 21219450
effective HSP length: 52
effective length of query: 180
effective length of database: 18121498
effective search space: -1033097656
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 40 (14.7 bits)
X3: 67 (24.6 bits)
S1: 41 (21.7 bits)
S2: 64 (28.4 bits)

What is the meaning of BLAST statistics and scores?

One may judge the results of a blast search by two numbers. One is the 'bit' score, which is defined as:

S' (bits) = [lambda * S (raw) - ln K] / ln 2

where lambda and K are Karlin-Altschul parameters. The expression of the score in terms of bits makes it independent of the scoring system used (i.e., which matrix).

The Expect value estimates the statistical significance of the match, specifying the number of matches, with a given score, that are expected in a search of a database of this size absolutely by chance. An Expect value of two, with a given score, would indicate that two matches with this score, are expected purely by chance. The expect value changes with the size of the database (in a larger database more chance matches with a given score are expected) and is the most intuitive way to rank results or compare the results of one query run against two different databases.

Expect (E) values The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences.

The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

How do I Search my own Sequences with Blast?

There are some BLAST services avaliable at CBI, such as:

  • BLASTn - Nucleotide-nucleotide BLAST
  • BLASTp - Protein-protein BLAST
  • BLASTx - Translated Query vs. Protein Database