How to subset reads from fastq file to use as test file on a script
3
0
Entering edit mode
5.2 years ago
msimmer92 ▴ 300

I know this question already exists here in Biostars, (see https://www.biostars.org/p/6544/#361743) but it has very old answers (the majority being from 7 years ago), so I wanted to update it.

Context: I created a script that recognises the extension of my file and: 1) if it is fastq, it maps it using Bowtie2; 2) if it's BAM, it does something else. I want to test my script with test files that are small enough for the program to run as fast as possible so I can correct it, but I only've got my original samples (big, human ChIP-seq data). (Bash script, Linux, Bash 4).

The question: I want to know how can I take a subset of reads from the fastq and make a test file with it. It has to be small enough for the program to run fast, but reasonable enough to get mapped without causing errors that might distract me from debugging my script. Which is the current best practice for this? Thank you!

bash fastq subset sample • 8.5k views
ADD COMMENT
1
Entering edit mode

Given a compressed fastq which you probably already have from whatever source:

zcat file.fastq.gz | head -n (multiplier of 4)

By the way, it is good practice to reference (provide links) threads you mention in questions. I am sure they contain helpful material.

ADD REPLY
1
Entering edit mode

fastq-sample from fastq-tools

ADD REPLY
3
Entering edit mode
5.2 years ago
Malcolm.Cook ★ 1.5k

I believe the answer remains seqtk - c.f. A: Selecting Random Pairs From Fastq?

ADD COMMENT
0
Entering edit mode

Even though the other comments and answers are ok as well, I selected this one because it was simple, fast and effective. Thanks!

ADD REPLY
1
Entering edit mode
5.2 years ago
GenoMax 141k

Use reformat.sh from BBMap suite.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
ADD COMMENT

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6