Question

Trimmomatics parameters for WGS data quality improvement

0

Entering edit mode

8.9 years ago

ravi.uhdnis ▴ 220

Hello everyone,

I have approx. 30X whole genome data of Human samples (5) [illumina HiSeq 2500, PE sequencing] and want to run Trimmomatics just for removing bad/poor quality bases/reads. As per 'FastQC' output the data seems to be good with warning only in 'Per Base Sequence Content', 'Per Sequence GC Content' and failure in 'K-mer' content. I am not worried about K-mer content but my aim is to remove the bad/poor reads. For that purpose, I ran the command similar to this one:

java -jar trimmomatic-0.30.jar PE \
  --phred33 \
  input_forward.fq.gz input_reverse.fq.gz \
  output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
  output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 \
  TRAILING:3 \
  SLIDINGWINDOW:4:15 \
  MINLEN:36

After running the command, each of my data set reduced with approx 10-13GB in overall folder size of each sample and quality improved as per FastQC plots. I want to be sure that is the standard way of using Trimmomatics for improving reads of WGS data of human samples. Down the line my aim is to find sequence variants, SNPs among these samples and their correlation with human disease progression. Particularly, I want to be sure in parameters of ILLUMINACLIP and SLIDINGWINDOW. please give me your suggestions. Thank you

genome next-gen-sequencing • 4.8k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by ravi.uhdnis ▴ 220

0

Entering edit mode

What's the number of reads before and after trimming? Similarly, what's the number of bp before and after?

ADD REPLY • link 8.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Hi, I didn't have the log file of run, i'll rerun it and then i'll update it more precisely. thank you

ADD REPLY • link 8.9 years ago by ravi.uhdnis ▴ 220

0

Entering edit mode

TrimmomaticPE: Started with arguments: -phred33 -trimlog trimlogfile R1.fastq.gz R2.fastq.gz R1_P.fastq.gz R1_UP.fastq.gz R2_P.fastq.gz R2_UP.fastq.gz ILLUMINACLIP:/usr/local/Trimmomatic-0.33/adapters/TruSeq2-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15
Multiple cores found: Using 16 threads
Using PrefixPair: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC'
Using Long Clipping Sequence: 'TTTTTTTTTTCAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 6 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 59069357 Both Surviving: 55425870 (93.83%) Forward Only Surviving: 3231511 (5.47%) Reverse Only Surviving: 223807 (0.38%) Dropped: 188169 (0.32%)
TrimmomaticPE: Completed successfully.

Are these are optimum parameters for WGS run of Human data with approx. 30X coverage in order to look for genetic variants/SNPs etc. Should i move ahead with 'Both Surviving' reads only or 'forward only surviving'/'Reverse only surviving' are also of some use?. Thank you, Regards..Ravi.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.9 years ago by ravi.uhdnis ▴ 220

1

Entering edit mode

Are these are optimum parameters for WGS run of Human data with approx. 30X coverage in order to look for genetic variants/SNPs etc.

You used the default parameters and I think it should be good enough. Frankly speaking, although trimming is important to make sure that you can increase the percentage of mapped reads and also reduce false positives in variant calling but I would not be super concern about the trimming parameters. You can control the quality of bases to be used towards the variant calling in samtools mpileup or GATK HaplotypeCaller.

Should I move ahead with 'Both Surviving' reads only or 'forward only surviving'/'Reverse only surviving' are also of some use?

94% read pairs have both reads survived. Because you are not loosing a lot by ignoring singleton reads, I would suggest you go with 'Both surviving' reads to keep it simple. FYI, Bowtie2 can take both a set of paired reads and also single end reads for alignment in one go. I dont think BWA can take both paired end and single end reads in one go. I would still recommend to use 'both surviving' reads.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thank you Ashutosh Pandey, Is it normal to lose 10-12G of fastq data while trimming process of a Human sample of WGS of about 30X coverage ?. My initial Forward and Reverse reads fastq data were 38G each file but when i operated Trimmomatics, it reduced to 26G each (Paired only). Although the 'both surviving' rate was ~94% but i don't want to lose important data while trimming process.....that's why put this question to be sure about parameters of Trimmomatics software.

ADD REPLY • link 8.7 years ago by ravi.uhdnis ▴ 220

1

Entering edit mode

Ok got it. It is not normal to loose one-third of the data. You are loosing a lot of data at the nucleotide level and not at the read level because 95% of pairs survived. This indicates that you have lost a lot of data as individual bases during trimming but not enough to meet the MINLEN parameter. As Noolean said, you may need to check average and median lengths of reads before and after trimming. You may do this analysis on a subset of the reads. You should also tell us about the platform used and the read length.

ADD REPLY • link 8.7 years ago by Ashutosh Pandey 12k

score 0 · Answer 1 · 2015-08-11

0

Entering edit mode

8.7 years ago

ravi.uhdnis ▴ 220

Hi Ashutosh, Many Thanks for your suggestions and responses. It's good to know that i am on right track and not doing any blunder. I have made links to my input files (Raw read's fastqc output as well as Trimmed files' fastqc output html pages). Please have a look on these (https://www.dropbox.com/sh/dy5hlxwm7wkrlb5/AAB1BiBzyXxPOK_6OyQd2uKba?dl=0). I have only these .html pages with information about reads number and leangth, before and after trimming. If there is any other better way to do that, please suggest me. Thanks again.

ADD COMMENT • link 8.7 years ago by ravi.uhdnis ▴ 220

1

Entering edit mode

Hi Ravi, I quickly looked at those fastqc files and they look good to me. They dont have the average and median length I was looking for but I get a feel that the reverse fastq files had lot of reads with poor quality ends (Per base sequence quality shows drop in quality towards 3' end and Per sequence quality scores have big tail towards the low scores). Although I am not 100 percent sure but this can be the reason. Just move ahead with the analysis or you can increase the window that you are using to trim your data. You can change it from 4:15 to 4:10 or 8:15.

ADD REPLY • link 8.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks for the comment Ashutosh, i don't know which software/tool provides the average/median length of the reads. I simply run FastQC for quality check. In case you know something, please let me know. Yah, you are right the reverse reads are poor, particularly 3' end. I am running my next analysis steps along with modification in window length at Trimming step, in order to see any major effect, in case any. Thank you very much, Ravi.

ADD REPLY • link 8.7 years ago by ravi.uhdnis ▴ 220