Hi,
I am performing fastqc quality check on bunch of fastq files I downloaded from SRA. The fastqc report shows overall poor quality with perbase sequence content and sequence duplication level are flagged red. There is no adapter content but a lot of sequences are present in overrepresented sequences category (less than 1%). So I ran trim_galore with default parameters with paired option. The post processing looks worse then before with no improvement in sequence duplication levels or overrepresented sequences.
Now there is no adapter content which flags, so I can't run with trimmomatic with adapter sequence. Could you tell me what processing I need to do to improve sequence quality. For the particular example I posted the alignment percentage to reference genome is 88% (paired sequences). I also have some single cell sequences from the same experiments which have 60-70% alignment.
Original
Overrepresented sequences
Post processing
Overrepresented sequences
Please use these instructions to add images properly: How to add images to a Biostars post
I am trying to upload pics,but I don't get them posted.
I showed you one example above.
Thanks! I was copying the wrong link.
Seeing "X" does not immediately reflect bad data. You have to take the results in context of the experiment you are looking at. Please take some time read the informative blog posts that FastQC team has on this site.
BTW: Your data looks ok (at least the bit you posted).
I added the overrepresented sequences part. I also want to mention that trim_galore detected Nextra trasnposase sequence which the fastqc doesn't show and then did the clipping, I think that resulted in variable sequence length.
That is expected. When extraneous sequences are trimmed that will happen.
Over-represented sequences could represent sequences that were enriched as a part of the experiment (e.g. a binding site). So even if FastQC flagged them they may represent a result you want. I suggest that you go along to the next step (as long as all extraneous sequence has been trimmed from the data).
Hi Genomax,
After talking to another bioinfo prof, he recommended me removing the overrepresented sequences as the source is tissue and not amplified RNA. Could you suggest any tool that can look and remove overrepresented sequences?
Thanks!
You could use
bbduk.sh
withliteral=sequence1,sequence2 etc
option fro BBMap suite. That said I don't think that is a good idea since you could be skewing your data in some way by selectively removing sequences from it.