Off topic:How To Interpret/Reconcile The Counts Obtained Via Samtools Flagstat
2
4
Entering edit mode
11.2 years ago
bow ▴ 790

I'm trying to understand a BAM file (how many reads are mapped, etc.) made from aligning a partial RNA seq library. Partial here means I only took the first 100,000 FASTQ records from the total paired end-reads (so 2 file of 100,000 reads each). Of these 200,000 I did some filtering that trims down the records to 174,904, then fed them into the mapper (I used GSNAP) to obtain the BAM file, and then sorts it using samtools sort. Before mapping, I've made sure all read pairs are intact (no singletons were present).

Now, when doing a samtools flagstat on the sorted BAM file, here's the output (with my comments afterwards):

200794 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
196324 + 0 mapped (97.77%:-nan%)  # call this mapped
200794 + 0 paired in sequencing # call this paired
100187 + 0 read1 # this is read1
100607 + 0 read2 # this is read2
193396 + 0 properly paired (96.32%:-nan%) # call this properly_paired
194097 + 0 with itself and mate mapped # this is mate_mapped
2227 + 0 singletons (1.11%:-nan%) # singletons
500 + 0 with mate mapped to a different chr #and this is mate_weird
328 + 0 with mate mapped to a different chr (mapQ>=5)

I then tried several samtools view command with the -c and -f flags to see if I understand the outputs correct, e.g.:

$ samtools view -c -f 0x2 file.bam # selects for properly paired reads
193396 (checks out with flagstat)

$ samtools view -c -f 0x40 file.bam # selects for first read
100187 (checks out with flagstat read1)

$ samtools view -c -f 0x80 file.bam # selects for last read
100607 (checks out with flagstat read2)

$ samtools view -c -f 0x42 file.bam # selects for properly paired first read
96702

$ samtools view -c -f 0x82 file.bam # selects for properly paired last read
96694

Now, my questions are:

  1. Of the initial 174,904 FASTQ records fed into the mapper, why does flagstat show 200,794 reads? I thought perhaps this is because there may be split reads (reads mapped to different exons), but I'm not sure..

  2. How come there are different numbers of mapped of read1 and read2 both before and after I select for properly paired reads? Aren't they supposed to be the same?

  3. I did some counting just to figure out if the numbers check out, but I think I may have missed some. In particular, I don't understand these:

  4. total - mate_mapped is 6697, while singletons is 2227. Why the discrepancy (4470) and what are those sequences?

  5. mate_mapped - mate_weird is 193,597, while properly_paired is 193,396 (201 reads discrepancy). Why the discrepancy and what are those sequences?

Thanks in advance for the help :)!

samtools bam • 6.6k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6