Question

Need Help Interpreting The Percent Identity From Blat M8/9 Output

0

Entering edit mode

11.5 years ago

biohack92 ▴ 170

I'm new to BLAT and I need help interpreting the output.

This is a portion of the m9 output from a BLAT alignment

# BLAT 35 [2009/02/26]
# Query: 488:2:6:613:1563
# Database: reference.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
488:2:6:613:1563        gi|315125111|ref|NC_014803.1|   90.74   108     10      0       1       108     1212216 1212109 7.9e-44 175.0
488:2:6:613:1563        gi|315125111|ref|NC_014803.1|   90.74   108     10      0       1       108     808780  808673  7.9e-44 175.0

Does the percent identity (90.74%) indicate % identity of the entire read or is this % identity of the fraction of the read?

I'm interpreting it as the %identity of the aligned portion of a read; how would you calculate the % identity of that particular read to that position?

blat alignment • 5.1k views

ADD COMMENT • link updated 11.5 years ago by Andrzej Zielezinski 11k • written 11.5 years ago by biohack92 ▴ 170

score 6 · Answer 1 · 2012-10-24

6

Entering edit mode

11.5 years ago

Andrzej Zielezinski 11k

Percent identity in BLAT results indicates a fraction of the read sequence that is aligned to a subject sequence. BLAT aims at identifying the best pair of regions, one from each sequence, query and subject, such that the alignment of these two regions is the best possible, but not necessarily across the entire two sequences.

Before you calculate the percent identity of a full-length read's sequence, you have to know four things:

Alignment length (alen)
Number of mismatches (mis)
Number of gaps (gaps)
Full length of a read sequence (len)

First three things are present in the m9 BLAT output.

Then, to calculate the percent identity use the following equation:

Pident = (alen-mis-gaps)/len*100

ADD COMMENT • link 11.5 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Awesome, thank you!

ADD REPLY • link 11.5 years ago by biohack92 ▴ 170

0

Entering edit mode

I have not used blast for a long time, so I could be wrong. But as I remember, identity=(alen-mis-gaps)/alen. This actually also makes more sense to me.

ADD REPLY • link 11.5 years ago by lh3 33k

0

Entering edit mode

Be aware that this formula assumes that all non-aligned portions of the read are mismatches, which is not necessarily true if you were actually doing the complete alignment. The FASTA package has a global-local aligner that fources one of the two sequences (the read in this case) to be entirely aligned against a part of another sequence. The PID of such a glocal alignment might be higher than the one of this formula.

ADD REPLY • link 11.5 years ago by Christian ★ 3.0k

0

Entering edit mode

I'm aligning sequencing reads to reference genomes so the equation seems to be suitable for the job.

ADD REPLY • link 11.5 years ago by biohack92 ▴ 170

score 1 · Answer 2 · 2012-10-24

You might have better luck deriving the actual % identity from .psl output. The m9 output is provided strictly for BLAST lovers - the "gap openings" are not "gapped bases" and the percent identity (google "millibad") is ridiculously affine to account for spliced alignments. Here is some awful PHP code I wrote in my youth to rebuild the alignment from the PSL:

function displayAlignment($seq,$target,$db,$qName,$strand,$qStart,$qEnd,$tName,$tStart,$tEnd,$qStarts,$tStarts,$blockCount,$blockSizes,$qSize)
{
    if(!$target)
    {
        $alignHash['alignment']='no target';
        $alignHash['verdict']='none';
        return($alignHash);
    }
    $rawTargetSeq=$target;
    if(!$seq)
    {
        $alignHash['alignment']='no sequence';
        $alignHash['verdict']='none';
        return($alignHash);
    }

    if($strand=='-')
    {
        $seq=revcomp($seq);
    }

    #query sequence
    for($seqCnt=0,$qPos=$qStarts[0],$tPos=$tStarts[0]-$tStart;$seqCnt<sizeof($qStarts);$seqCnt++)
    {
        $tStarts[$seqCnt]=$tStarts[$seqCnt]-$tStart;
        #double gap
        #&& $qStarts[$seqCnt]-$qPos==$tStarts[$seqCnt]-$tPos
        if($qStarts[$seqCnt]>$qPos && $tStarts[$seqCnt]>$tPos && true)
        {
            $doubleGap=min($qStarts[$seqCnt]-$qPos,$tStarts[$seqCnt]-$tPos);
            for($i=0;$i<$doubleGap;$i++)
            {
                $qSeq.=substr($seq,$qPos+$i,1);
                $tSeq.=substr($rawTargetSeq,$tPos+$i,1);
            }
            $qPos+=$doubleGap;
            $tPos+=$doubleGap;
        }
        if($qStarts[$seqCnt]>$qPos)
        {
            for($i=0;$i<$qStarts[$seqCnt]-$qPos;$i++)
            {
                $tSeq.='-'; #insert in query seq, next alignment block starts later
                $qSeq.=substr($seq,$qPos+$i,1);
            }
        }
        if($tStarts[$seqCnt]>$tPos)
        {
            for($i=0;$i<$tStarts[$seqCnt]-$tPos;$i++)
            {
                $qSeq.='-';
                $tSeq.=substr($rawTargetSeq,$tPos+$i,1);
            }
        }
        $qSeq.=substr($seq,$qStarts[$seqCnt],$blockSizes[$seqCnt]);
        $tSeq.=substr($rawTargetSeq,$tStarts[$seqCnt],$blockSizes[$seqCnt]);
        $qPos=$qStarts[$seqCnt]+$blockSizes[$seqCnt];
        $tPos=$tStarts[$seqCnt]+$blockSizes[$seqCnt];
    }

    if($strand=='-')
    {
        $qSeq=revcomp($qSeq);
        $tSeq=revcomp($tSeq);
    }

    $mismatches=0;
    $mismatches50=0;
    $mismatches100=0;
    $redMode=false;
    $alignLen=strlen($qSeq);#arbitrary, could also be tSeq
    for($i=0;$i<strlen($qSeq);$i++) {
        if(preg_match('/[NP]/i',substr($qSeq, $i, 1)) || preg_match('/[NP]/i',substr($tSeq, $i, 1)))
        {
            #do not count for or against
            $alignLen--;
        }
        else
        {
            if (strtoupper(substr($qSeq, $i, 1)) != strtoupper(substr($tSeq, $i, 1))) {
                $mismatches++;
                if($i<50) $mismatches50++;
                if($i<100) $mismatches100++;
                if(!$redMode)
                {
                    $fqSeq.="<SPAN class=\"red\">";
                    $redMode=true;
                }
                $fqSeq.=substr($qSeq, $i, 1);
                $ftSeq.=substr($tSeq, $i, 1);
            }
            else
            {
                if($redMode)
                {
                    $fqSeq.="</SPAN>";
                    $redMode=false;
                }
                $fqSeq.=substr($qSeq, $i, 1);
                $ftSeq.=substr($tSeq, $i, 1);
            }
        }
    }
    if($alignLen>=100)
    {
        if($mismatches50<=1 || $mismatches100 <=2 || ((($alignLen-$mismatches)/$alignLen)*100) > 98) $verdict="PASS"; else $verdict="FAIL";
        $stats.="<br>$mismatches50 first 50 (".(((50-$mismatches50)/50)*100)."%)<br>$mismatches100 first 100 (".(((100-$mismatches100)/100)*100)."%)<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
    }
    else if($alignLen>=50)
    {
        if($mismatches50<=1 || ((($alignLen-$mismatches)/$alignLen)*100) > 98) $verdict="PASS"; else $verdict="FAIL";
        $stats.="<br>$mismatches50 first 50 (".(((50-$mismatches50)/50)*100)."%)<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
    }
    else
    {
        if($mismatches==0) $verdict="PASS"; else $verdict.="FAIL";
        $stats.="<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
    }

        $alignment.=$stats;
        $alignment.="<br>query: ".$fqSeq;
        $alignment.="<br>target:".$ftSeq;
    #fclose($fa);
    if(strlen($alignment)<8000)
        $alignHash['alignment']=$alignment;
    else
        $alignHash['alignment']='too long to display';
    $alignHash['verdict']=$verdict;
    return($alignHash);
}