Question

How To Split Reads For Different Flowcell Lanes In Fastq Files?

1

Entering edit mode

10.7 years ago

newDNASeqer ▴ 760

My fastQ file was delivered by the sequencing core as a combined file that has reads from two flow cell lanes. I am wondering if there's a way to split the reads from the two lanes? The downstream pipeline is Tophat-cufflinks-cuffmerge-cuffdiff.

I've also read the documentation of Tophat and did not see an option of splitting the reads in tophat, so I am asking here in this forum. thanks

split reads • 10k views

ADD COMMENT • link updated 5 months ago by steve ★ 3.5k • written 10.7 years ago by newDNASeqer ▴ 760

0

Entering edit mode

is the lane in the ID for each read ? If so, you could write a simple python/perl script to do that.

ADD REPLY • link 10.7 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Do the IDs have any distinguishing marks? (They should.) If you post a brief snippet containing a read from each lane, one of us could probably whip up a quick script or at least help you get started.

ADD REPLY • link 10.7 years ago by Alex Reynolds 35k

score 8 · Answer 1 · 2013-08-02

8

Entering edit mode

10.7 years ago

Rm 8.3k

Quick Awk solution to separate merged fastq file based on lane

paste - - - -  my.R1.fastq | awk -F"\t" '{ split($1, arr, ":"); print $1 "\n" $2 "\n+\n" $4 >"lane."arr[4]".R1.fastq" }'

ADD COMMENT • link 10.7 years ago by Rm 8.3k

4

Entering edit mode

I have a pure Awk solution that is much faster. Like the above solution, let's assume that the records are blocks of 4 lines:

awk 'BEGIN {FS = ":"} {lane=$4 ; print > "lane."lane".fastq" ; for (i = 1; i <= 3; i++) {getline ; print > "lane."lane".fastq"}}' < my.R1.fastq

Using the getline command 3 times, you can read blocks of 4 lines (from the standard input, hence the <).

ADD REPLY • link 10.7 years ago by Frédéric Mahé ★ 3.2k

1

Entering edit mode

Thanks for this solution - I tried it and it works fast and nicely. I'm not familiar with awk, so could you please explain why your solution is faster please?

ADD REPLY • link 7.9 years ago by DVA ▴ 630

0

Entering edit mode

Can you modify the solution so that it works if the sample is sequenced on two different flowcells but, by coincidence, both runs have the same lane number? Also, my data is GZipped FASTQ.

ADD REPLY • link 2.9 years ago by dario.garvan ▴ 530

0

Entering edit mode

this is faster if you use mawk instead of GNU awk

ADD REPLY • link 5 months ago by steve ★ 3.5k

0

Entering edit mode

mawk is often an alias of awk.

$ /usr/bin/awk 
Usage: mawk [Options] [Program] [file ...]

ADD REPLY • link 5 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

right, but not always, as I recently discovered, some of our systems have awk aliased to gawk, while some of our containers have it aliased to mawk. Resulted in very confusing performance discrepancies.

ADD REPLY • link 5 months ago by steve ★ 3.5k

0

Entering edit mode

Totally rad. I love one-liners.

ADD REPLY • link 10.7 years ago by Dan D 7.4k

0

Entering edit mode

+1 for the paste

ADD REPLY • link 10.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm wondering, if it would be correct way to work with paired-end reads (not just a single fastq file)? Will the order be the same in the resulted files containing forward and reverse reads? Or may be there is a more safe solution for paired-end reads?

ADD REPLY • link 5.1 years ago by Denis ▴ 290

1

Entering edit mode

Are you referring to reads from multiple lanes in one file or just interleaved R1/R2 reads from a single lane?

It should be fine to use this solution as long as nothing else has been done to original files. You can do a quick check with repair.sh from BBMap suite after separating the files to make sure the read order is retained post-split.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Yes, i have two fastq files - one with forward and another with reverse reads. In each file there are reads from all 8-th Illumina lanes and i need to split their by lane so that order of reads in all resulted R1 files and correspondig R2 files be the same.

ADD REPLY • link 5.1 years ago by Denis ▴ 290

score 4 · Answer 2 · 2013-08-02

4

Entering edit mode

10.7 years ago

Dan D 7.4k

enter image description here

See that highlighted "3" in the first line? That's the lane number in the FASTQ standard. If you read in your FASTQ file and direct your reads to different output files based on that value, you'll have different FASTQ files separated by lane.

Do you need help writing the script to do that?

ADD COMMENT • link 10.7 years ago by Dan D 7.4k