Biostar Beta. Not for public use.
Question: cat files with GNU Parallel
0
Entering edit mode

I have several fastq files with the following naming convention:

SampleID_SampleNo_LaneNo_R1_001.fastq.gz. The SampleNo ranges from S0 to S87 while the LaneNo ranges from L001 to L004. An example name is 1025-00_S15_L002_R1_001.fastq.gz I want to use GNU parallel to put all *R1_001.fastq.gz per sample in one file irrespective of the LaneNo. Something like

parallel --dryrun -v --progress "cat  {} > SampleID_SampleNo_R1_001.fastq.gz" :::  *L00{1..4}_R1_001.fastq.gz
ADD COMMENTlink 5.5 years ago tayebwajb • 90 • updated 5.5 years ago ole.tange ♦ 3.4k
Entering edit mode
0

FYI : If you want to map your fastq with bwa or whatever, believe me, you don't want to concatenate those files.

ADD REPLYlink 5.5 years ago
Pierre Lindenbaum
120k
Entering edit mode
1

How about I map each lane fastq separately so that I get a bam per lane and merge these bams to get per sample bam? So then there is no need to concatenate.

ADD REPLYlink 5.5 years ago
tayebwajb
• 90
Entering edit mode
0

This is usually what I do, and you can go one step further and just map all the file parts separately and then merge the alignment by lane and then sample.

ADD REPLYlink 5.5 years ago
Matt Shirley
9.0k
Entering edit mode
0

Yes, the 001 part is the first portion on the edge of the flow-cell and usually has the worst quality.

ADD REPLYlink 5.5 years ago
Matt Shirley
9.0k
0
Entering edit mode

This is not the time to use parallelization. You want the files to go together in a defined order, and the same order. Parallelizing could make separate bits go in front or behind each other. Asynchronous processing will cause nondeterministic resource allocation.

Also: you won't get much speedup because 'cat' doesnt use a lot of CPU, you're already bound by the speed of the disk drives, so accessing them multiple times is a detriment.

You could fix both issues with a binary tree system. Doing pairs and working up to a complete picture. It's more effort; worthwhile if you do this task repeatedly. Let the operating system and RAID drivers handle that kind of parallelization. With regards to the userland application, a single "cat" will dump the datafile as fast as possible.

ADD COMMENTlink 5.5 years ago karl.stamm 3.5k
0
Entering edit mode

I don't believe you need or want to use Parallel for this as you don't need the parallel execution. Try grep/xargs/cat:

ls *.fastq.gz | egrep '1025-00_S15_L00[1-4]_R1_001' | xargs cat >> 1025-00_S15_R1_001.fastq.gz

ADD COMMENTlink 5.5 years ago Matt Shirley 9.0k
0
Entering edit mode

If you have GNU Parallel 20140822 this should work:

parallel eval 'cat {=s/L001_/L*_/=} > {=s/L001_//=}' ::: *_L001_R1_001.fastq.gz

karl.stamm mentions that you may be limited by disk speed. While that used to be absolutely true (and still is if you only have a single harddisk), it is no longer always true on RAID drives, network drives and SSD drives. On a RAID device I experienced 600% speed up if I ran 10 jobs in parallel, and a smaller speedup if I run either more or fewer jobs in parallel. In other words the only way you can know is by trying and measuring.

ADD COMMENTlink 5.5 years ago ole.tange ♦ 3.4k
Entering edit mode
1

Linux does a great job of caching disk access with RAM, so it could be fooling you and finishing up the process in the background. To get a 600% speedup, you'll have to be writing to six drives at once, which would imply more than seven drives in the RAID array. If your network storage is really huge, this is quite possible. My servers will only write with two drives. If your SAN is in the petabyte range, of course it will have more R/W heads available, and probably another layer of RAM cache.

ADD REPLYlink 5.5 years ago
karl.stamm
3.5k
Entering edit mode
1

The RAID was a 40 disk RAID60, so it is not unreasonable to get a 600% speedup. The size of the data was so much bigger than RAM that caching in practice would have no effect. It still does not change the conclusion: What used to be always true about parallel disk I/O, sometimes no longer is, so measure instead of assume.

ADD REPLYlink 5.5 years ago
ole.tange
♦ 3.4k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0