I am trying to identify reads in a BAM file that deviate from the reference, be it SNP or indel, based on a list of targeted SNPs/indels I have.
Using BEDtools intersect or samtools view -L, I can identify reads that overlap these SNP/indel regions, but I haven't found how to check if the nucleotide matches the reference or matches the SNPs/indels in my list.
My SNPs and indels are tracks in CLC Bio and in a spreadsheet, so these can be formatted into whatever is needed pretty readily - .bed, .vcf, .gff, etc.
For example:
I have a SNP at position 82432 in my reference genome. The reference genome lists a 'T' here, but I want to know if any given read shows a 'C' in its place. How would I do that?
Something like
Right?
This is looking good, thanks!
You probably want to generate a VCF. Something like:
(from the example workflow: http://www.htslib.org/workflow/#mapping_to_variant )
This is great for identifying where SNPs are and what the nucleotide is at that position, but only if I survey it manually. Is there a tool that I can feed a file containing those positions and nucleotides of interest? Not every SNP is important to me and often, at SNP positions, only one alternate nucleotide is of interest. Do you know of a tool for that?
Also, thanks for the help!
You were already doing that when you gave samtools a bed file. You can still do that. I was just showing an example pipeline to get better formatted results.
Parse the big long line, count how many alternate letters are there.