Entering edit mode
7.0 years ago
Shicheng Guo
★
9.4k
Hi All,
I finish a perl script to check the phred score for bam files. I just want to make sure there is no bugs in it.
Thanks.
Usage: perl bam2phred.pl input.bam
#!/usr/bin/perl
use strict;
if ($ARGV[0] =~ /^-[h?]/) {
print "Usage: determine-phred FILE
Reads a bam, possibly gzipped and returns the phred-scale,
either 64 or 33, based on a quick scan of the data in the file.
";
exit 0;
}
my $cnt;
my $dphred = 64;
my $input=$ARGV[0];
if ($ARGV[0] =~ /\.bam$/) {
$ARGV[0] = "samtools view '$ARGV[0]'|";
}
my $qual;
my $comm;
my $fmt;
if (@ARGV > 1) {
my @mult = @ARGV;
for my $f (@mult) {
@ARGV = ($f);
determine();
print "$f\t$dphred\n";
}
} else {
determine();
print "$input: $dphred\n";
}
sub determine {
$_ = <>;
# bam
$fmt = 'bam';
$qual = (split(/\s+/, $_))[10];
if (!$qual) {
die "$input: Unknown file format\n";
}
my $ssiz=7000; # sample size
while($qual) {
for (my $i =length($qual)/2; $i < length($qual); ++$i) {
if (ord(substr($qual,$i,1)) < 64) {
$dphred = 33;
$cnt=$ssiz; # last
last;
}
}
$qual = '';
last if ++$cnt >= $ssiz; # got enough
if ($fmt eq 'fq') {
# fastq
scalar <>; # id
scalar <>; # read
scalar <>; # comment
$qual = <>;
chomp $qual;
} else {
# sam
$qual = (split(/\t/, $_))[10];
}
}
}
I am wondering, isn't this step (know the scoring method) necessary before aligning the reads/alignment ? Since some aligners expect us to provide this information.
Yes, the score needs to be known prior to alignment. Also, the sam format specifies that all quality scores be in ASCII-33.
Hi Shicheng Guo, what is the main purpose of this code ? In what kind of situation, this can be used. I assume this can be only used if you want to realign BAM (alignment) file.
I hope to achieve the phred score from bam files which download from public database with this script.
But what is the use of getting the phred score information after alignment has been done (because BAM/SAM will already have phred converted to +33 base quality). Please correct me if I am wrong.
Please check this SAM specification.
The results are not consistent with determine-phred. I'm not sure which one is correct. if there is '?' or numbers, it should be Q33. With this case,
determine-phred
is right. It seems that thedetermine-phred
accepts a revised sam format ( deleting the lines start with @ ). A phred score reference is here and here. Thank you.The code is used on condition that you have gotten bam or sam files after mapping. Just change Q63 (if your fastq is) to Q33 for the bam files. @EagleEye .
Update:
~/local/bam2phred.pl test.bam
will outputtest.bam: 64
samtools view test.bam | gawk 'BEGIN{FS="\t"}!/^@/{split($11, aa, ""); for (i in aa) data[aa[i]]}END{ asorti(data); for (j in data) printf data[j]; print ""}'
will output#&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
, which is Q33.I cannot read perl so far, it seems there should be bugs in your script or the similar one determine-phred.