Divergent number of replicates between FastQC and samtools
3
1
Entering edit mode
4.9 years ago
user31888 ▴ 130

I have a total of 300 million reads in my R1 and R2 raw fastq files.

How come fastQC count about 50% duplicates in each .fastq files, but after alignment with STAR (default parameters), none of the reads have the 1024 flag (PCR or optical duplicates) when doing samtools view -c -f 1024 mydata.bam?

fastqc STAR samtools • 2.1k views
ADD COMMENT
3
Entering edit mode
4.9 years ago
h.mon 35k

Because STAR with default settings doesn't mark duplicates, you have to use --bamRemoveDuplicatesType UniqueIdentical to have duplicates marked (or --bamRemoveDuplicatesType UniqueIdenticalNotMulti).

edit: also, to elaborate on the point genomax makes:

STAR (if you choose to mark duplicates with the correct parameter, which apparently you didn't) and FastQC use very different methods of detecting duplicates, so it shouldn't be surprising if the number os estimated duplicates is different. The documentation for FastQC states:

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed

And also:

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis.

Whereas STAR uses mapping position:

UniqueIdentical

mark all multimappers, and duplicate unique mappers.

The coordinates, FLAG, CIGAR must be identical

ADD COMMENT
1
Entering edit mode

Thanks for your very complete answer ! I actually did use --bamRemoveDuplicatesType UniqueIdentical, but my version of STAR (2.5.1b) does not tag duplicates in the same time as aligning fastq reads. I needed to (1) align the .fastq reads; (2) tag duplicates using star --runMode inputAlignmentsFromBAM --inputBAMfile input.bam --bamRemoveDuplicatesType UniqueIdentical.

ADD REPLY
0
Entering edit mode

You are correct, and current version (2.7.1a) works the same as the version you are using (2.5.1b). The pdf manual is incomplete and doesn't explain this, but the command-line help does:

### BAM processing
bamRemoveDuplicatesType  -
    string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only
                        -                       ... no duplicate removal/marking
                        UniqueIdentical         ... mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical
                        UniqueIdenticalNotMulti  ... mark duplicate unique mappers but not multimappers.
ADD REPLY
3
Entering edit mode
4.9 years ago

In addition to the points made by @h.mon and @genomax, another point is that if I remember correctly, FastQC only gives you the single-ended duplication rate - that is from a given fastq file, how many reads have exactly the same sequence.

However, this is not how tools that function on alignments (like STAR's --bamRemoveDuplicatesType UniqueIdentical, or Picard's MarkDuplicates or samtools's markdup) function. They look at a pair of reads. If they have the same outer alignment co-ordinates, they are marked as duplicates. Thus in the following pairs:

    R1                    R2
---------->          <----------

    R3                           R4
---------->                  <----------

          R5               R6
      --------->      <---------     

          R7                        R8
      --------->             <----------

fastqc would mark everything as a duplicate because (R1=R3, R2=R6, R5=R7 and R4=R8). This assumes no sequencing errors. samtools/Picard/STAR would mark nothing as a duplicate because no pair has identical coordinates at both ends.

ADD COMMENT
0
Entering edit mode

Thanks for your explanation. It's a bit out of the topic here, but I am just curious. According to you, if one have single-end reads, fastqc and samtools / picard / STAR should output the same number of duplicates then?

ADD REPLY
0
Entering edit mode

Well almost. FastQC tags on having the same sequence. samtools/picard/STAR tags on having the same alignment coordinates. The two might not be equivalent in cases where you have a sequencing error in a read that leads to it having a different sequence, but still aligning to the same location.

ADD REPLY
1
Entering edit mode
4.9 years ago
GenoMax 141k

FastQC does not look at your entire dataset for some of the parameters it is testing. That would be memory/time consuming. It samples your data as it goes along.

So that 50% number is an estimate. It may vary depending how reads are distributed in your data.

ADD COMMENT
0
Entering edit mode

Correct. After counting them all "manually", I found a different number as well.

ADD REPLY

Login before adding your answer.

Traffic: 1745 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6