You might have better luck deriving the actual % identity from .psl output. The m9 output is provided strictly for BLAST lovers - the "gap openings" are not "gapped bases" and the percent identity (google "millibad") is ridiculously affine to account for spliced alignments. Here is some awful PHP code I wrote in my youth to rebuild the alignment from the PSL:
function displayAlignment($seq,$target,$db,$qName,$strand,$qStart,$qEnd,$tName,$tStart,$tEnd,$qStarts,$tStarts,$blockCount,$blockSizes,$qSize)
{
if(!$target)
{
$alignHash['alignment']='no target';
$alignHash['verdict']='none';
return($alignHash);
}
$rawTargetSeq=$target;
if(!$seq)
{
$alignHash['alignment']='no sequence';
$alignHash['verdict']='none';
return($alignHash);
}
if($strand=='-')
{
$seq=revcomp($seq);
}
#query sequence
for($seqCnt=0,$qPos=$qStarts[0],$tPos=$tStarts[0]-$tStart;$seqCnt<sizeof($qStarts);$seqCnt++)
{
$tStarts[$seqCnt]=$tStarts[$seqCnt]-$tStart;
#double gap
#&& $qStarts[$seqCnt]-$qPos==$tStarts[$seqCnt]-$tPos
if($qStarts[$seqCnt]>$qPos && $tStarts[$seqCnt]>$tPos && true)
{
$doubleGap=min($qStarts[$seqCnt]-$qPos,$tStarts[$seqCnt]-$tPos);
for($i=0;$i<$doubleGap;$i++)
{
$qSeq.=substr($seq,$qPos+$i,1);
$tSeq.=substr($rawTargetSeq,$tPos+$i,1);
}
$qPos+=$doubleGap;
$tPos+=$doubleGap;
}
if($qStarts[$seqCnt]>$qPos)
{
for($i=0;$i<$qStarts[$seqCnt]-$qPos;$i++)
{
$tSeq.='-'; #insert in query seq, next alignment block starts later
$qSeq.=substr($seq,$qPos+$i,1);
}
}
if($tStarts[$seqCnt]>$tPos)
{
for($i=0;$i<$tStarts[$seqCnt]-$tPos;$i++)
{
$qSeq.='-';
$tSeq.=substr($rawTargetSeq,$tPos+$i,1);
}
}
$qSeq.=substr($seq,$qStarts[$seqCnt],$blockSizes[$seqCnt]);
$tSeq.=substr($rawTargetSeq,$tStarts[$seqCnt],$blockSizes[$seqCnt]);
$qPos=$qStarts[$seqCnt]+$blockSizes[$seqCnt];
$tPos=$tStarts[$seqCnt]+$blockSizes[$seqCnt];
}
if($strand=='-')
{
$qSeq=revcomp($qSeq);
$tSeq=revcomp($tSeq);
}
$mismatches=0;
$mismatches50=0;
$mismatches100=0;
$redMode=false;
$alignLen=strlen($qSeq);#arbitrary, could also be tSeq
for($i=0;$i<strlen($qSeq);$i++) {
if(preg_match('/[NP]/i',substr($qSeq, $i, 1)) || preg_match('/[NP]/i',substr($tSeq, $i, 1)))
{
#do not count for or against
$alignLen--;
}
else
{
if (strtoupper(substr($qSeq, $i, 1)) != strtoupper(substr($tSeq, $i, 1))) {
$mismatches++;
if($i<50) $mismatches50++;
if($i<100) $mismatches100++;
if(!$redMode)
{
$fqSeq.="<SPAN class=\"red\">";
$redMode=true;
}
$fqSeq.=substr($qSeq, $i, 1);
$ftSeq.=substr($tSeq, $i, 1);
}
else
{
if($redMode)
{
$fqSeq.="</SPAN>";
$redMode=false;
}
$fqSeq.=substr($qSeq, $i, 1);
$ftSeq.=substr($tSeq, $i, 1);
}
}
}
if($alignLen>=100)
{
if($mismatches50<=1 || $mismatches100 <=2 || ((($alignLen-$mismatches)/$alignLen)*100) > 98) $verdict="PASS"; else $verdict="FAIL";
$stats.="<br>$mismatches50 first 50 (".(((50-$mismatches50)/50)*100)."%)<br>$mismatches100 first 100 (".(((100-$mismatches100)/100)*100)."%)<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
}
else if($alignLen>=50)
{
if($mismatches50<=1 || ((($alignLen-$mismatches)/$alignLen)*100) > 98) $verdict="PASS"; else $verdict="FAIL";
$stats.="<br>$mismatches50 first 50 (".(((50-$mismatches50)/50)*100)."%)<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
}
else
{
if($mismatches==0) $verdict="PASS"; else $verdict.="FAIL";
$stats.="<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
}
$alignment.=$stats;
$alignment.="<br>query: ".$fqSeq;
$alignment.="<br>target:".$ftSeq;
#fclose($fa);
if(strlen($alignment)<8000)
$alignHash['alignment']=$alignment;
else
$alignHash['alignment']='too long to display';
$alignHash['verdict']=$verdict;
return($alignHash);
}
Awesome, thank you!
I have not used blast for a long time, so I could be wrong. But as I remember, identity=(alen-mis-gaps)/alen. This actually also makes more sense to me.
Be aware that this formula assumes that all non-aligned portions of the read are mismatches, which is not necessarily true if you were actually doing the complete alignment. The FASTA package has a global-local aligner that fources one of the two sequences (the read in this case) to be entirely aligned against a part of another sequence. The PID of such a glocal alignment might be higher than the one of this formula.
I'm aligning sequencing reads to reference genomes so the equation seems to be suitable for the job.