normalize different size fastq files
2
0
Entering edit mode
3.1 years ago
serene.s • 0

Hi I have different sizes of fastq files eg one is 3 GB and other one is 18 GB in size. For analysis I want to randomly select reads from the larger files to make them upto 5 GB in size. I want to know if there is any programme or script that can do this.

Thank You Saraswati

metagenome data genomics normalization • 2.0k views
ADD COMMENT
1
Entering edit mode

What type of data is this, and what is the ultimate goal of the analysis?

ADD REPLY
0
Entering edit mode

This is shotgun metagenome data and my goal is to to do taxonomic classification by alignment using Diamond and later import the output to megan6 community edition to classify. But the larger fastq files are taking very longtime to give diamond blast output also the larger files are producing output in GB size which is difficult for megan to analyse as it hangs my system.

thanks for responding :-)

ADD REPLY
1
Entering edit mode
3.1 years ago

By size probably not but there are many options to select by read number, testing it out with a few sizes can help you hone in on the number that is closest to the size you want (Google either of the tools seqtk or seqkit for more details)

seqtk sample

prints:

Usage:   seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>

Options: -s INT       RNG seed [11]
         -2           2-pass mode: twice as slow but with much reduced memory

or using seqkit:

seqkit sample -h

prints:

sample sequences by number or proportion.

Usage:
  seqkit sample [flags]

Flags:
  -h, --help               help for sample
  -n, --number int         sample by number (result may not exactly match)
  -p, --proportion float   sample by proportion
  -s, --rand-seed int      rand seed (default 11)
  -2, --two-pass           2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin
ADD COMMENT
0
Entering edit mode

Hi Istvan Albert Thanks for your response I have one more question which is out of this topic. Do you have any idea what should be the e value while doing Diamond blastx against nr database? Their default value is 1e-3. I want to find hits for my whole metagenome sequence reads. And how would it affect my results if I select evalue 1e-20 for getting faster results?

Thanks Saraswati

ADD REPLY
1
Entering edit mode
3.1 years ago
GenoMax 141k

You are asking for two different operations. Down-sampling and normalization are different operations.

You can use reformat.sh from BBMap suite to simply down-sample the larger file. Following options are relevant.

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

Depending on the aim of the analysis it may be more appropriate to normalize the data using bbnorm.sh. A guide is available.

ADD COMMENT

Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6