I'd like to map metagenomes (2 x 150 bp reads) against selected genes (500 - 3000 bp in length) using Bowtie2. I expect it might happen often then that reads only map partially as they "go beyond" the reference gene as illustrated in the example below.
Read: CTCGACTGGGGGATCTCGATTTCGTAGACGGCTTGCGGTT...
||||||| |||||||| |||||||
Reference Gene: ...CGTTGACGTTAGAAGGGAAGGACTGGGCGATCTCGACTTCGTAG (end of gene)
I'm wondering if bases of the read that are beyond the end of the gene are treated as mismatches, i.e. penalties are added to the alignment score for these?
My desired behavior would be that the read is treated as if it would be only as long as the overlapping region (27 bases in the above example). Bases on the right end of the read in the example shall be clipped, but not the three bases on the left end of the read.
Does anyone now if this can be set in Bowtie2, or possibly if this is even the default behavior with --end-to-end
mode?
I know that there is the --local
mode, but this is not exactly what I want as it may clip also the left three bases of the read. Maybe there are settings I could use in combination with --local
to achieve the desired behavior?
Part of the read that is not mapped would be soft-clipped. Sounds like that is what you want?
I think my question was unclear so I amended it a bit. So yes, I'd like the read to be clipped. However, referring to the above example, only it's right end but not it's left end shall be clipped.
Left-end should be soft-clipped as well. Have you looked at the CIGAR strings in your alignment file?
Actually I haven't run the mapping yet. I'd just like to find out which settings are reasonable for mapping against single genes, and start the mapping then.
How many genes are you going to map against? Is this a targeted/capture experiment? You always run the risk of forcing reads to align in places where they should not if you used a reduced reference and the original experiment was done with full genome.
I'm planning to use about 100 single copy genes from a reference genome (bacterial isolate) and map different environmental metagenomes against them. If I use stringent mapping conditions (or eventually filter mapped reads afterwards), I don't expect reads that would e.g. align better to a different region of the genome to map to the reference gene.
Proof is going to be in the pudding so to speak. Let us know how you fare on the alignments. Results would be hard to predict since we have no idea what is in your samples and what fraction of the data will align to the genes you start with. Do keep track of multi-mapping reads just in case.
Sure. My question was not really about predicting results but rather how Bowtie2 scores an alignment as shown above.
If you are looking for specifics on how that particular alignment would be scored (in terms of MAPQ) then you may need to dive into code. You will want to read this blog post about mapping qualities.
I think what you are thinking of is a
score
like blast+ alignment. That concept is not directly transferable to NGS alignments.I was not referring to the mapping quality but the alignment score (AS tag in the sam file). I really should have been more specific in my question, I changed it a bit further. To calculate the alignment score, Bowtie2 (default) uses e.g. the following penalties:
mismatch: 2 - 6, gap open: 5, gap extend: 3
The minimum alignment score needed for an alignment to be considered “valid” can be set using
--score-min
. Now my question is if Bowtie2 scores all the bases at the right end of the read in my example as mismatches? If so, the alignment will never be considered valid with a reasonable score-min. If I use--local
mode these bases will be clipped, but then I also loose the penalties of the three left-end bases, which I'd like to avoid. Anyway, I’ll probably just try around with a small set of made-up sequences to see what happens.