FASTQ file: duplicated sequences and overall poor quality
0
0
Entering edit mode
5.6 years ago
piyushjo ▴ 700

Hi,

I am performing fastqc quality check on bunch of fastq files I downloaded from SRA. The fastqc report shows overall poor quality with perbase sequence content and sequence duplication level are flagged red. There is no adapter content but a lot of sequences are present in overrepresented sequences category (less than 1%). So I ran trim_galore with default parameters with paired option. The post processing looks worse then before with no improvement in sequence duplication levels or overrepresented sequences.

Now there is no adapter content which flags, so I can't run with trimmomatic with adapter sequence. Could you tell me what processing I need to do to improve sequence quality. For the particular example I posted the alignment percentage to reference genome is 88% (paired sequences). I also have some single cell sequences from the same experiments which have 60-70% alignment.

Original

orliginal

Overrepresented sequences

original OS

Post processing

post

Overrepresented sequences

postOS

fasqtc trimming trimgalore • 2.5k views
ADD COMMENT
0
Entering edit mode

Please use these instructions to add images properly: How to add images to a Biostars post

ADD REPLY
0
Entering edit mode

I am trying to upload pics,but I don't get them posted.

ADD REPLY
0
Entering edit mode

I showed you one example above.

ADD REPLY
0
Entering edit mode

Thanks! I was copying the wrong link.

ADD REPLY
0
Entering edit mode

Seeing "X" does not immediately reflect bad data. You have to take the results in context of the experiment you are looking at. Please take some time read the informative blog posts that FastQC team has on this site.

BTW: Your data looks ok (at least the bit you posted).

ADD REPLY
0
Entering edit mode

I added the overrepresented sequences part. I also want to mention that trim_galore detected Nextra trasnposase sequence which the fastqc doesn't show and then did the clipping, I think that resulted in variable sequence length.

ADD REPLY
0
Entering edit mode

I think that resulted in variable sequence length.

That is expected. When extraneous sequences are trimmed that will happen.

Over-represented sequences could represent sequences that were enriched as a part of the experiment (e.g. a binding site). So even if FastQC flagged them they may represent a result you want. I suggest that you go along to the next step (as long as all extraneous sequence has been trimmed from the data).

ADD REPLY
0
Entering edit mode

Hi Genomax,

After talking to another bioinfo prof, he recommended me removing the overrepresented sequences as the source is tissue and not amplified RNA. Could you suggest any tool that can look and remove overrepresented sequences?

Thanks!

ADD REPLY
1
Entering edit mode

You could use bbduk.sh with literal=sequence1,sequence2 etc option fro BBMap suite. That said I don't think that is a good idea since you could be skewing your data in some way by selectively removing sequences from it.

ADD REPLY

Login before adding your answer.

Traffic: 1967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6