Hi :)
I have assembled paired-end reads (illumina strand-specific) using trinity with some different settings, and then merged these assemblies and done some filtering. I now want to assemble the un-mapped reads (pairs that do not map concordantly), and I therefore mapped the reads to the assembly using bowtie2/2.2.9.
This was the command:
(bowtie2 -p 16 --un-conc-gz all_unmapped_R%.fq.gz --no-unal -k 20 -x merged3_may2018_corset.fasta -q -1 all_R1.fq -2 all_R2.fq | samtools view -@16 -Sb -o merged3_may2018_corset_bowtie2.bam) 3>&1 1>&2 2>&3 | tee merged3_may2018_corset.align_stats
Which gave the following statistics:
954299926 reads; of these:
954299926 (100.00%) were paired; of these:
**211582978** (22.17%) aligned concordantly 0 times
185228876 (19.41%) aligned concordantly exactly 1 time
557488072 (58.42%) aligned concordantly >1 times
----
211582978 pairs aligned concordantly 0 times; of these:
11851194 (5.60%) aligned discordantly 1 time
----
199731784 pairs aligned 0 times concordantly or discordantly; of these:
399463568 mates make up the pairs; of these:
196662163 (49.23%) aligned 0 times
48139061 (12.05%) aligned exactly 1 time
154662344 (38.72%) aligned >1 times
89.70% overall alignment rate
Based on this, I thought there were 211,582,978 read pairs that did not align concordantly, and that these pairs would be separated into an R1 and an R2 file, thus with 211,582,978 reads in each.
However, when counting the lines in the files and dividing by four, I get that there are 620,381,610 reads in R1 and 620,027,677 in R2...
So not only are there three times more un-mapped reads than I thought, and they have not been output as proper pairs :-/
Obviously, it makes no sense to assemble these un-mapped reads when they constitute 2/3 of all the reads...
Does anyone have any idea why there is this discrepancy? And how do I solve it?
Cheers Sjannie
Apparantly, it is an issue with the -k > 1 and the slightly old version (2.2.9) of bowtie2 that I am using: https://github.com/BenLangmead/bowtie2/issues/37
Are you counting the lines of the gzipped file ? If it is compressed, then the number of lines is not directly related to the number of reads. To make sure you are counting right, you can also use fastqc on your .fq.gz files.
I counted the lines of the unzipped file. When I do the same for the original file, I get the correct number of reads, 954,299,926, for both R1 and R2. So I do not think that the counting method is the problem. Also, the files with all reads are 273Gb, while the files with the un-mapped reads are 177-178Gb, which is what alerted be to the problem in the first place.
But thanks anyway :)
Here is the stats reported by samtools flagstat:
954,299,926 minus 1,485,433,896 / 2 = 211,582,978, which is the number given by the bowtie stats also (it would have been surprising of course if they differede...).
I will try and re-run the bowtie2 command without --no-unal, and then capture unmapped reads from the bam file, using -F 2 to exclude only properly paired alignments. Based on the samtools stats, that really should give the correct number of unmapped reads - right?
Though I still do not understand why it is not the correct number from bowtie2 in the first place...
It would be interesting to compare the number of unmapped reads (-F 4) and the number of not properly aligned pairs (-F 2).