I have short (20 bp) sequencing reads for DNA that has been in e. coli undergoing different rates of SOS response (inducing mutations in the DNA). When I use bcftools to do variant calling of these reads, mapping them back to the ~10,000 original 20 bp long DNA sequences, I seem to get variants at particular positions but lots of identical variants at the same position. Ie. the mutations seem to all be occurring in the same positions and not distributed in any way.
What am I doing or interpreting wrong? Here is my alignment and variant calling code. In it I align with bowtie, then view and sort with SAM tools, then use bcftools mpileup and bcftools call before filtering and producing a .vcf file.
bowtie2 -p 10 --end-to-end --very-sensitive -3 1 -I 0 -X 200 -x <bowtie2 index file path> -1 <1st paired end read data> -2 <2nd paired read data> | samtools view -bSu - | samtools sort - | bcftools mpileup -d 4000000 -q 0 -Q 20 -Ou -f AM_FullLibrary.fa - --threads 10 | bcftools call -Ou -m -v | bcftools filter -Ov -s LowQual -e '%QUAL<20' > temp.vcf
Opening my VCF file in python using Pandas, I first remove anything that didnt pass the filter and then look at the DP4 information field and sum up the numbers in the 3rd and 4th entries which should correspond to non-reference alignments at this position right?
When I do this I will get that reference DNA example 'a' had a substitution at position 18 from A to T, and the INFO column has:
So here I interpret this as there being that: out of all my sequencing data, there were 15 reads that passed the filter (first entry in DP4) that were reference aligned (no mutations) at this position. And then there were 10 reads that had the substitution at this position?
But for this DNA example 'a' the only variant I am getting is at this position! Surely the mutations for example 'a' should be more uniformly distributed rather than all 10 occurring in the same position?
Any help or insight is most appreciated!