Count References Read From Bam File
2
2
Entering edit mode
11.2 years ago
Gvj ▴ 470

I want to know how many reads (above certain score ) supporting References allele at each position in the genome from a BAM file. What I am doing is converting BAM to mpileup format and then parsing it. samtools mpileup -d 10000 -C 1 -E -f file.bam > file.mpileup

Of course mpileup file is too big that it need time to process. Is there any tool to get References allele count (EXCLUDING SNP ALLELE READ) above certain score ?

Thanks,

bam • 6.1k views
ADD COMMENT
2
Entering edit mode
11.2 years ago
Irsan ★ 7.8k

in a samtools pileup file, the fifth column contains the read base at a particular nucleotide position. A dot means that the base in the read is the same as the forward strand, a comma means the read base is the same for the reverse strand. So when you count the occurences of a dot or a comma in the 5th column of the pileup file in each line you have the amount of reference bases in each read:

cut -f5 file.mpileup | sed 's/[\.,]//' | awk '{ print length }' > counts.tsv
ADD COMMENT
0
Entering edit mode

Thanks Irsan. I read about pileup format ( http://samtools.sourceforge.net/pileup.shtml). Something which I couldn't find is the meaning of ">" and "<" symbols in 5th column.

I was looking for a method where I don't have to make mpileup file (Its really big and takes lot of time to make it), otherwise I was doing like

perl -ne '@a=split " ";$l=$a[4]=~tr/[.,]//; print "$a[0] $a[1] $l\n"' file.mpileup > file.mpileup.ref.count

ADD REPLY
0
Entering edit mode

I found the answer for ">" and "<" at http://samtools.sourceforge.net/samtools.shtml it means for a reference skip I am not deleting this comment, so that someone could benefit out of it

ADD REPLY
0
Entering edit mode

You dont have to save the output of samtools mpileup to a file, you can just redirect it to something that counts the amount of reference calls

samtools mpileup [your options] file.bam | cut -f5 | ...
ADD REPLY
0
Entering edit mode
11.2 years ago

The other thing you could try is an all-points vcf. That should output the # of HQ reads at each position in one of the columns.

It's also large, so maybe pipe that into cut, or grep, or awk or something.

ADD COMMENT

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6