Extracting randomly subset of fastq reads from a huge file
4
1
Entering edit mode
6.7 years ago
KVC_bioinfo ▴ 590

Hello,

I am trying to extract a random subset of paired end fastq reads from a huge 5Gb file. I tried using seqtk and zcat. However, it is taking long time.

The command I used:

zcat huge_file.fastq.gz | sort -R | head -1000 > reads.sample_random.fq

Could someone please help me? Thank you in advance.

random fastq extract • 12k views
ADD COMMENT
1
Entering edit mode

Ummm, won't sort -R completely screw everything up (and be slow and use a LOT of memory)? Define "a long time", you're mostly limited by IO and gzip.

ADD REPLY
0
Entering edit mode

long time: more than 2 hours

ADD REPLY
0
Entering edit mode

I found these commands. However, what is -s (seed)? And is it sampling randomly?

Thank you in advance.

Apply a seed to extract the same reads from two, paired end files:

seqtk sample -s 10 /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq

seqtk sample -s 10 /ebs/ecoli/SRR001666_2.fastq.gz 1000 > SRR001666_2_1000.fastq

ADD REPLY
0
Entering edit mode

In this particular case the same seed is needed to keep the pairing of reads (i.e. extract the same read pair from two files).

ADD REPLY
0
Entering edit mode

Thank you. But what does the number after s represent?

ADD REPLY
1
Entering edit mode

It's a random seed. It doesn't matter what number you use, you just have to use the same one (I usually use 1234 for things like this).

ADD REPLY
0
Entering edit mode

Got it. thank you very much.

ADD REPLY
0
Entering edit mode

Worked perfectly for me. Thank you, everyone, for responses.

ADD REPLY
4
Entering edit mode
6.7 years ago
GenoMax 141k

reformat.sh from BBMap suite.

reformat.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=Sampled_R1.fq.gz out2=Sampled_R2.fq.gz parameters_below

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
ADD COMMENT
1
Entering edit mode

Ah, I got ninja'd :)

ADD REPLY
0
Entering edit mode

So shall I use reads=100 (no of reads I want to extract?) parameter?

ADD REPLY
2
Entering edit mode

"reads=100" means it will sample from 100 reads. So if you set"reads=100 samplerate=0.5" you'll get approximately 50 reads, sampled from the first 100 reads in the file. Whereas if you set "samplereadstarget=100" you will get exactly 100 reads, sampled from the full file.

Note that if your reads are paired, these will be the number of PAIRS you get, so the number of reads would be twice that.

ADD REPLY
0
Entering edit mode

Thank you for the explanation.

ADD REPLY
1
Entering edit mode
6.7 years ago

I assume huge_file.fastq.gz is interleaved?

Using BBMap, you can do this:

reformat.sh in=huge_file.fastq.gz out=subsampled.fastq.gz samplerate=0.1 interleaved

With 2 files you can use in1= and in2= instead. There's also "samplereadstarget=X" if you want a specific number of reads (for paired data, it will give you that number of pairs).

ADD COMMENT
1
Entering edit mode
6.6 years ago

See if you'd like to try this with VSEARCH.

ADD COMMENT
0
Entering edit mode

I think VSEARCH subsampling doesn't work with fastq files, only with fasta.

ADD REPLY
0
Entering edit mode

It allows:

    Subsampling
  --fastx_subsample FILENAME  subsample sequences from given FASTA/FASTQ file

I'm using: vsearch v2.0.3

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6