Biostar Beta. Not for public use.
Question: How to interpret this duplicate sequence plot from FastQC
3
Entering edit mode

Hello everyone!

I have read the help document from FastQC group, but there is not enough detailed information.

Here is my understanding of this duplicate sequence plot from FastQC:

From the title "Percent of seqs remaining if deduplicated 14.11%", it means if I do some deduplication process on my data, I will only get 14.11%? Which means the duplication level is very high?

From the red line I can say about 60% of the deduplicated sequences are at the duplication level of "1", about 25% of the deduplicated sequences are at the duplication level of ">10"?

From the blue line I can say about %10 of the total sequences are at the duplication level of "1" and about 65% of the total sequences are at the duplication level of ">10"?

Is this interpretion right?

Can I say the libraries can contain technical duplication according to this plot? What else analysis should I do to exclude this judgement´╝č

Background can be found here

Thank you very much in advance!

ADD COMMENTlink 2.4 years ago SMILE • 100 • updated 2.4 years ago Kevin Blighe 43k
1
Entering edit mode

This question already has a very good answer in this thread by the main moderator: Revisiting the FastQC read duplication report

Kevin

ADD COMMENTlink 2.4 years ago Kevin Blighe 43k
Entering edit mode
0

Thank you Kevin!

I have read this answer and the updated interpretation of the new version of FastQC duplicate sequence plot. My data looks more like the Example 3, but it is different. Their duplication levels are most above thousdands of times. But in my case, most of the duplication levels are around 10 times. So is my data better? What is the different implications of differetn duplication levels? If the thousands of times of duplication levels can indicate a technical error in sequencing. How about 10 times? Some people suggest not trusting the duplicate sequencing plot too much, considering the per base quality plot to gain a realistic assessment of the duplication. In my case, my per base sequence quality is great, but I have a high proporation of reads in 10 times duplication levels, what does this imply?

ADD REPLYlink 2.4 years ago
SMILE
• 100
Entering edit mode
0

Hey,

I think that your plot is more like Example 2, but it is just that you have a greater magnitude of duplication.

Your plot indicates that 65% of your reads are duplicated between 10-50 times - the spike may be here purely because it's looking at 40 different levels of duplication (10x, 11x, 12x, 13x, ... 49x, 50x). Did you run the sample through more than one PCR amplification step?

This level of duplication may not be ideal, but I don't believe that it will cause a major problem for you in downstream analyses. As you mentioned, there is disagreement in the field of sequencing about the importance of removing these duplicate sequences. The best thing to do is to run the analysis separately by removing the duplicates and also by not removing, and to see what differences you get.

ADD REPLYlink 2.4 years ago
Kevin Blighe
43k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0