Get average base quality of bases in a read
1
0
Entering edit mode
18 months ago
pablosolar.r ▴ 20

Hi all! Given a BAM, I want to plot the insert size of a read pair vs average base quality of bases in a read I've actually got the insert size data point by filtering my BAM using:

samtools view -f66 mybam.bam  | cut -f9 | awk '{print sqrt($0^2)}' > isizes.txt  # To prevent negatives

But now I have doubts in how to fetch the mean of quality of all bases in both reads of a read pair. Could yo please give me a clue or something? Thank you very much!

Other tools/scripts are also welcomed!

NGS BAM Samtools • 937 views
ADD COMMENT
0
Entering edit mode

See FASTQ Phred33 average base quality score for some clarification on average base quality.

Average base quality does not make a lot of sense in real world.

ADD REPLY
0
Entering edit mode
18 months ago

not fully tested, but it could be:

samtools collate -O -f in.bam TMPDIR |\
samtools view |\
awk -F '\t' 'BEGIN{N="";Q="";for(i=33;i<256;i++) ord[sprintf("%c",i)]=i-33;} {if($1==N) {Q=sprintf("%s%s",Q,$11);V=0.0; for(i=1;i<=length(Q);i++) {V+=ord[substr(Q,i,1)];} printf("%s %f\n",$1,V/length(Q));  N="";Q="";} else {N=$1;Q=$11;}}'
ADD COMMENT
0
Entering edit mode

Hi Pierre, could you please explain that command? Thank you very much!! Is it collate really necessary?

ADD REPLY
0
Entering edit mode

Is it collate really necessary?

Yes. To make sure the read pairs are next to each other in BAM file.

ADD REPLY
0
Entering edit mode

Yes but in that case the expected results have not the same length. I mean, getting the insert sizes:

samtools view -f66 mybam.bam  | cut -f9 | awk '{print sqrt($0^2)}' > isizes.txt  # To prevent negatives

cat isizes.txt | wc -l
> 108853

And running the Pierre command I am getting:

samtools collate -O -f mybam.bam TMPDIR | samtools view |awk -F '\t' 'BEGIN{N="";Q="";for(i=33;i<256;i++) ord[sprintf("%c",i)]=i-33;} {if($1==N) {Q=sprintf("%s%s",Q,$11);V=0.0; for(i=1;i<=length(Q);i++) {V+=ord[substr(Q,i,1)];} printf("%s %f\n",$1,V/length(Q));  N="";Q="";} else {N=$1;Q=$11;}}' | awk '{split($0,a," "); print a[2]}' > average_bq.txt

cat average_bq.txt | wc -l
> 57604

So plot can't be performed due to the different length of data points. I am really missed with this... Thank you all!

ADD REPLY

Login before adding your answer.

Traffic: 2067 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6