Split Fastq Files Into Chunks Of 1M Reads
6
4
Entering edit mode
12.7 years ago
Bioscientist ★ 1.7k

Each of my fastq files is about 20M reads, while I need to split the big fastq files into chunks of 1M reads.

Is there any available tool that can do such jobs?

my thought is, just count the line, and print out the lines after counting every 1M lines.But how can I do that with python?

thx

edit: My input fastq file is actually in .gz compressed form.

I tried

split -l 4000000 XXX.recal.fastq.gz prefix

however, I just got one prefix-aa file which is exactly the same size as input. I don't know if it's because of the .gz form so that we cannot count the line?

when I tried

split -b 46m XXX.recal.fastq.gz prefix

it works well!!! The fastq.gz is successfully split into several smaller fastq.gz files.

so why cannot we use -l 4000000 command?

thx

another question:there is only a "prefix" option for split command; but is there a suffix option?(only suffix_length option)

because with prefix the output is XXX.fastq.gz-ab, which destroys the format of .gz file.

So I want sth. like XXX_1.fastq.gz (changing suffix), how can I do that?

thx

split fastq • 34k views
ADD COMMENT
1
Entering edit mode

split will only work on text (not gzipped files) and you will likely get truncated records at the end of each file using -b 46m. you can use:

zless recal.fastq.gz | split -l 4000000 prefix

to get a bunch of uncompressed files.

ADD REPLY
0
Entering edit mode

hi brentp, I tried zless...|split -l 4000000 prefix. the error is cannot open `prefix' for reading: No such file or directory. Seems prefix is regarded as input here.

ADD REPLY
0
Entering edit mode

my bad you need '-' to tell it stdin: zless...|split -l 4000000 - prefix

ADD REPLY
0
Entering edit mode

Yeah, works well, until it doesn't;) As I wrote, for sure you'll get your file split always, but the result is not necessarily a valid fastq files, but good it works for you, just check if every file that is generated starts with a '@'. All depends on that the sequence lines are not wrapped, not containing newlines, which they could by definition of the format.

ADD REPLY
0
Entering edit mode

sorry, it doesn't work...error is:"split: invalid option -- E"

ADD REPLY
0
Entering edit mode

my advice is to paste the entire command you're using into your question; then let @Jeremy update his answer since that's the clearest solution.

ADD REPLY
8
Entering edit mode
12.7 years ago
Marvin ▴ 890

A gzipped file cannot be "split by lines" without decompressing it. Split doesn't decompress, but it can read from a pipe:

zcat XXX.recal.fastq.gz | split -l 4000000 - prefix

Naturally, the output will be uncompressed.

ADD COMMENT
6
Entering edit mode
12.7 years ago

On the command line:

split -l 4000000 myFastq.fq

behold xaa, xab, xac etc...

ADD COMMENT
1
Entering edit mode

my split (coreutils 8.5) uses -l/--lines for number of lines per file.

ADD REPLY
1
Entering edit mode

Unfortunately, that will not work for fastq files with entries containing newlines, so it's definitely not safe. The fastq format definition allows that, I had the same misconeption and was happy to be corrected here: http://biostar.stackexchange.com/questions/3405/processing-fastq-files-in-r-bioconductor-without-reading-entire-files-into-memory/4083#4083

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

This is a good solution, because a fastq file is defined to be 4 lines long, and using a command line program to do it is fast and reliable.

ADD REPLY
0
Entering edit mode

brentp, mine does too, not sure where -m came from

ADD REPLY
0
Entering edit mode

thx guys, i got the file xaa, which seems to be a binary file. then how can I use this file for further analysis like alignment?

ADD REPLY
0
Entering edit mode

also, I need to split hundreds of fastq files so need to distinguish the output by assigning different names. (while seems split of different fastq files here all result in the same xaa file)

ADD REPLY
0
Entering edit mode

you can specify the prefix (type "man split" for the manual)> presumably you would just use the name of the fastq file as the prefix

ADD REPLY
0
Entering edit mode

It shouldn't be a binary file if your input was a fastq file. Did you check the output from split yet?

ADD REPLY
0
Entering edit mode

hi guys, plz look at my edit, thx

ADD REPLY
3
Entering edit mode
21 months ago
opplatek ▴ 280

You can also use seqkit split2. It can do: number of sequences (-s), number of files (-p), and length of sequences (-l). It's very fast as most of the seqkit utils. Splitting of 150M reads of single-end FASTQ into 20M chunks took real 1m0.190s (user 3m54.514s, sys 0m2.081s).

seqkit split2 reads_1.fq.gz -s 20000000 -O out -f -e .gz # for SE reads
seqkit split2 -1 reads_1.fq.gz -2 reads_2.fq.gz -s 20000000 -O out -f -e .gz # for PE reads
ADD COMMENT
2
Entering edit mode
12.7 years ago
Ian 6.0k

If you want an already written script:

SHRIMP has a script called 'splitreads.py'. [part of installation]

PerM has a script called 'splitReads.sh'.

Take your pick :-)

ADD COMMENT
1
Entering edit mode
8.5 years ago
Andreas ★ 2.5k

For the sake of completeness: the latest version of famas (commit 04ed15e) can do this as well. The split is based on number of reads though and not file size.

Andreas

ADD COMMENT
0
Entering edit mode
12.7 years ago
Michael 54k

I don't know python, but some python pros told me that it can do string handling and regular expressions (almost ;) ) as powerful as Perl, so here is what I would do in Perl, guess it would be easy to translate if you must:

If you have enough memory to read the whole file into one string, so something like:

my @records = split /\n@/, $filecontent;

That way getting an array of fastq records. Or, if that doesn't fit, change the input record separator (in perl that's $/, default n) to "n@", then read the file record by record:

local $/ = "\n@";
while (<FASTQ>) {
  # do stuff
}

Or, in case you are 100% sure, there are no wrapped lines, use Jeremy's method.

Edit: Even this is not totally safe, the '@' can appear in the quality string, and then in theory a 'n@' is possible in the quality string. Then you have to keep track of the number of opening brackets 'n+' and closing brackets 'n@' making the coding just a little more cumbersome.

So, please, do not use this ill defined FASTQ format! ... What, there is nothing else? You must be kidding me!

ADD COMMENT
0
Entering edit mode

Has anyone ever seen a FASTQ file where there was an errant newline? (from a file created by software used by more than 2 people)

ADD REPLY
0
Entering edit mode

No, I haven't, but when I proposed split -l, people objected vigorously, so I assumed, this was to be taken serious. Btw.: "The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string)." (Wikipedia). So, use split -l 4X, but please, please, take very good care...

ADD REPLY
0
Entering edit mode

hi michael, I checked some files, and it does start with @. But you are right, definitely should be cautious with using split this way.

ADD REPLY
0
Entering edit mode

I should have ignored the comments I got back then , from now on I will again assume that every record has 4 lines and that split is safe :P

ADD REPLY
0
Entering edit mode

Oh yes, that's actually the best answer ever: DON'T USE FASTQ! You can always replace FastQ by BAM, and you get a reasonably well-defined, compressed format with good supporting tools.

ADD REPLY

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6