I am dealing with breast cancer data from the TCGA Database; and I am trying to get a site frequency spectrum on all of the mutations in the cancerous chromosome-position pairs. For the non-cancerous data I have a VCF file, and for the cancerous data I have a BAM file. So far I have a script that reads the VCF file for non-cancerous data and stores all heterozygous chromosome-position pairs in a dictionary ancestors := {(chrom,pos)} -> (allele) (Which I can easily output into a text file if need be). I am now looking for a way to take the BAM file and get , for every (chromosome, position) in the above dictionary , a list containing the quality of score and allele for every read in the BAM file at the specified chromosome-position pair. To do this I need some command to filter the BAM file.
If I get this working I will then be able to obtain a site frequency spectrum for mutations in the cancerous data. I was trying to use SAMTOOLS and ANGSD but was unable to find the correct command.
Have you considered writing a small program in python using pysam? The mpileup API from that is what you need.
Thanks! I was able to do it with SAMTOOLS. In particular I used the command: mpileup -l pos.BED -b bam.filelist >> results.mpileup The tricky part is that the quality scores are given in ascii code.