Question

Solutions to FastQC diagnosis of Micro-Satellites

0

Entering edit mode

6.4 years ago

dec986 ▴ 370

Hello,

I'm looking through some FastQC reports on public data I've downloaded. The reports are mediocre quality, and the trimmers aren't making a difference.

I've discussed this with a colleague, who says that my problem is that many reads are showing overrepresentation in the middle of the sequence. Because the trimmers look for trimming at the beginning and end of sequences, the trimmers or QC tools won't be able to make a difference. My colleague used the phrase "micro-satellites" which will be seen around nucleotides 105-109 with a sharp drop and then rise in %A. I think this will have a very negative effect on the alignment process.

Are there any tools to correct such micro-satellites here? Is this the correct phrase to describe this error? Should I even worry about this? per_base_sequence_content

fastqc RNA-Seq • 1.7k views

ADD COMMENT • link 6.4 years ago by dec986 ▴ 370

3

Entering edit mode

How did you download this data? Did you split read 1 and read 2? Are you sure these should be 200bp reads?

fastq-dump has an option to split R1+R2, in case this is paired end data.

ADD REPLY • link 6.4 years ago by h.mon 35k

2

Entering edit mode

Agree with you, seems the data is a combination of R1+R2. The abnormal cycles in the middle are actually the beginning cycles of R2.

ADD REPLY • link 6.4 years ago by chen ★ 2.5k

0

Entering edit mode

@chen and @h.mon I have written a program, because the NCBI page https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2532810 doesn't suggest anything about paired-end reads. Also, if I remember how paired-end reads work, there should be some indication of this in the SEQID lines of the file should indicate this. I don't see it here when I show the top label lines:

SRR5335775.fastq.bz2
@SRR5335775.1 GEN-HISEQ:203:C6WRPACXX:2:1101:1496:2241 length=200

SRR5335776.fastq.bz2
@SRR5335776.1 GEN-HISEQ:203:C6WRPACXX:3:1101:1317:2249 length=200

SRR5335777.fastq.bz2
@SRR5335777.1 GEN-HISEQ:203:C6WRPACXX:4:1101:1166:2117 length=200

SRR5335780.fastq.bz2
@SRR5335780.1 GEN-HISEQ:203:C6WRPACXX:2:1101:2228:2142 length=200

SRR5335781.fastq.bz2
@SRR5335781.1 GEN-HISEQ:203:C6WRPACXX:3:1101:1243:2189 length=200

SRR5335782.fastq.bz2
@SRR5335782.1 GEN-HISEQ:203:C6WRPACXX:4:1101:1255:2172 length=200

SRR5335783.fastq.bz2
@SRR5335783.1 GEN-HISEQ:203:C6WRPACXX:5:1101:1250:2127 length=200

SRR5335784.fastq.bz2
@SRR5335784.1 GEN-HISEQ:203:C6WRPACXX:1:1101:1421:2226 length=200

ADD REPLY • link 6.4 years ago by dec986 ▴ 370

1

Entering edit mode

This is a paired-end set of data. If you look at SRA you will see that. I suggest that you avoid GEO/SRA alltogether and download the fastq files from ENA.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

hi h.mon,

did you mean --split_spot or --split-3?

ADD REPLY • link 6.4 years ago by dec986 ▴ 370

1

Entering edit mode

--split-files should be the option. While you are at it use -F to recover original Illumina style fastq read headers.

That said, see my comment above about getting the fastq files directly from ENA.

ADD REPLY • link 6.4 years ago by GenoMax 141k