How To Randomly Select 20M Reads From A 200M Fastq File
3
0
Entering edit mode
10.7 years ago
Angel ▴ 220

Hi All,

I have an RNASeq data from which I need to randomly select 20 million reads, around 5 times. The whole file is about 200m reads.

What is a way to do this? Does anyone have a script to share? Thanks.

fastq • 9.7k views
ADD COMMENT
2
Entering edit mode
ADD COMMENT
0
Entering edit mode
10.7 years ago
Andreas ★ 2.5k

There are plenty of ways listed in the link provided by Aaron.

Let me suggest one more solution, which works if know already how many reads you roughly have and need, is random (i.e. won't produce identical results when run multiple times) and works on (optionally gzipped) paired-end fastq files: famas -n 10

Andreas

ADD COMMENT
0
Entering edit mode
10.7 years ago
Dan D 7.4k

Here's a program I wrote back in the day called fastqReconstituter.pl Code golfers may weep when they see it, but it works. First argument is the number of reads you want to pull. Second argument is the FASTQ from which you want to pull random reads. An optional third argument specifies the destination file name. Otherwise a destination file name will be created for you. This script is not paired-end aware:

#!/usr/bin/perl
use strict;
use warnings;

#This tool selects random blocks from a FASTQ file and reassembles them into a new file 
#This allows an investigator to truncate multiple FASTQ files 
#in a way that they will be equal in size, but avoid a bias 
#in terms of flowcell geography.

#The program takes two mandatory arguments and one optional argument
#Mandatory arguments: [#] -> Number of random FASTQ blocks to select.
#                     [path] -> full path to a FASTQ file.
#One optional argulment: [destpath] -> destination path for the resultant read.
#                        By default this will just be a simple rename of the
#                        source FASTQ file.

#capture the arguments
my $numBlocks = $ARGV[0];
print "The first argument is $numBlocks\n";
my $invPath = $ARGV[1];
print "The second argument is $invPath\n";
my $destPath;
if($ARGV[2]){
    my $destPath = $ARGV[2];
    print "The third argument is $destPath\n";
}

#get the current year and month
my($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime;
my $currentMonth = $mon + 1;
$currentMonth = sprintf("%02d", $currentMonth);
my $currentYear = $year - 100;
$mday = sprintf("%02d", $mday);
$sec = sprintf("%02d", $sec);
$min = sprintf("%02d", $min);
$hour = sprintf("%02d", $hour);

#initialize validation flags
my $validSwitch = 0; #this must be true or the script will terminate before opening the FASTQ file
my $validNum = 0; #this is a switch for the number of blocks argument
my $validPath = 0;
my $validDest = 0;

#validate or create the destination path argument
if($destPath){
    if(open DESTFILE, ">$destPath"){
        $validDest = 1;
    } else {
        die "The destination file could not be created, given the path you specified!\n$!\n";
    }
} else {
    my $destFileName = "randomizedFastQ-$currentYear$currentMonth${mday}_$hour$min$sec.fastq";
    if(open DESTFILE, ">$destFileName"){
        $validDest = 1;
        print "Creating a file with the name of $destFileName.\n";
    }
    else{
        die "The destination file could not be created at the default location.\n$!\n";
    }
}

#validate the number of blocks argument
if($numBlocks =~ /^\d+$/){
    $validNum = 1;
    print "A valid argument was supplied for the number of blocks.\n";
    } else {
    print "A valid argument was NOT supplied for the number of blocks!\n";
    }

#validate the source file argument    
if(-e $invPath){
    $validPath = 1;
    print "A valid argument was supplied for the path.\n";
    } else {
    print "A valid argument was NOT supplied for the source file path!\n";
    }

if($validNum && $validPath && $validDest){
    $validSwitch = 1;
    print "Setting the valid switch to TRUE.\n";
    }

unless ($validSwitch){
    &usagexit;
}

print "Writing the FASTQ file to RAM...\n";
open INPUTFILE, $invPath or die "Can't open the input file path!\n $!\n";
my $marker = 0;
my @fileLines = <INPUTFILE>;
my $numLines = @fileLines;

print "The file has $numLines lines.\n";

unless($numLines % 4 == 0){
    die "Something's up with the FASTQ file. It doesn't have a total line count divisible by four. Aborting.\n";
} 
my $totalBlocks = $numLines / 4;
my $blockCounter = $numBlocks;

unless($totalBlocks > $numBlocks){
    die "You wanted more reads than the file contains.\n";
}

print "Generating block index...\n";
my @blockPulled;

for(my $i = 0; $i < $totalBlocks; $i++){
    $blockPulled[$i] = 0;
}

#generate a random number in a range which has its upper limit as the number of blocks
while($blockCounter > 0){
    my $startLine;
    my $randomNum = int(rand($totalBlocks));
    print "A random number is $randomNum.\n";
    if (!$blockPulled[$randomNum]){
        $blockPulled[$randomNum] = 1;
        $startLine = $randomNum * 4;
        my $marker = 0;
        my $currentLine = 0;
        while($marker < 4){
            $currentLine = $marker + $startLine;
            print $fileLines[$currentLine];
            print DESTFILE $fileLines[$currentLine];
            $marker++;
        }
    } else {
        redo;
    }
    $blockCounter--;
}

print "The total number of lines in the file is $numLines.\n";
print "Which of course means that the file contains $totalBlocks reads.\n";
print "I'm going to pull out $numBlocks of those reads at random, like you asked.\n";

close DESTFILE;
close INPUTFILE;

print "Exiting gracefully...\n";

sub usagexit{
    die "Usage: fastqRandomBlocks.pl [number of reads to harvest] [path to fastq file]\n\n";
}
ADD COMMENT

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6