I'm trying to understand a BAM file (how many reads are mapped, etc.) made from aligning a partial RNA seq library. Partial here means I only took the first 100,000 FASTQ records from the total paired end-reads (so 2 file of 100,000 reads each). Of these 200,000 I did some filtering that trims down the records to 174,904, then fed them into the mapper (I used GSNAP) to obtain the BAM file, and then sorts it using samtools sort
. Before mapping, I've made sure all read pairs are intact (no singletons were present).
Now, when doing a samtools flagstat
on the sorted BAM file, here's the output (with my comments afterwards):
200794 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
196324 + 0 mapped (97.77%:-nan%) # call this mapped
200794 + 0 paired in sequencing # call this paired
100187 + 0 read1 # this is read1
100607 + 0 read2 # this is read2
193396 + 0 properly paired (96.32%:-nan%) # call this properly_paired
194097 + 0 with itself and mate mapped # this is mate_mapped
2227 + 0 singletons (1.11%:-nan%) # singletons
500 + 0 with mate mapped to a different chr #and this is mate_weird
328 + 0 with mate mapped to a different chr (mapQ>=5)
I then tried several samtools view
command with the -c and -f
flags to see if I understand the outputs correct, e.g.:
$ samtools view -c -f 0x2 file.bam # selects for properly paired reads
193396 (checks out with flagstat)
$ samtools view -c -f 0x40 file.bam # selects for first read
100187 (checks out with flagstat read1)
$ samtools view -c -f 0x80 file.bam # selects for last read
100607 (checks out with flagstat read2)
$ samtools view -c -f 0x42 file.bam # selects for properly paired first read
96702
$ samtools view -c -f 0x82 file.bam # selects for properly paired last read
96694
Now, my questions are:
Of the initial 174,904 FASTQ records fed into the mapper, why does flagstat show 200,794 reads? I thought perhaps this is because there may be split reads (reads mapped to different exons), but I'm not sure..
How come there are different numbers of mapped of
read1
andread2
both before and after I select for properly paired reads? Aren't they supposed to be the same?I did some counting just to figure out if the numbers check out, but I think I may have missed some. In particular, I don't understand these:
total
-mate_mapped
is 6697, whilesingletons
is 2227. Why the discrepancy (4470) and what are those sequences?mate_mapped
-mate_weird
is 193,597, whileproperly_paired
is 193,396 (201 reads discrepancy). Why the discrepancy and what are those sequences?
Thanks in advance for the help :)!