How To Split Reads For Different Flowcell Lanes In Fastq Files?
2
1
Entering edit mode
10.7 years ago
newDNASeqer ▴ 760

My fastQ file was delivered by the sequencing core as a combined file that has reads from two flow cell lanes. I am wondering if there's a way to split the reads from the two lanes? The downstream pipeline is Tophat-cufflinks-cuffmerge-cuffdiff.

I've also read the documentation of Tophat and did not see an option of splitting the reads in tophat, so I am asking here in this forum. thanks

split reads • 10k views
ADD COMMENT
0
Entering edit mode

is the lane in the ID for each read ? If so, you could write a simple python/perl script to do that.

ADD REPLY
0
Entering edit mode

Do the IDs have any distinguishing marks? (They should.) If you post a brief snippet containing a read from each lane, one of us could probably whip up a quick script or at least help you get started.

ADD REPLY
8
Entering edit mode
10.7 years ago
Rm 8.3k

Quick Awk solution to separate merged fastq file based on lane

paste - - - -  my.R1.fastq | awk -F"\t" '{ split($1, arr, ":"); print $1 "\n" $2 "\n+\n" $4 >"lane."arr[4]".R1.fastq" }'
ADD COMMENT
4
Entering edit mode

I have a pure Awk solution that is much faster. Like the above solution, let's assume that the records are blocks of 4 lines:

awk 'BEGIN {FS = ":"} {lane=$4 ; print > "lane."lane".fastq" ; for (i = 1; i <= 3; i++) {getline ; print > "lane."lane".fastq"}}' < my.R1.fastq

Using the getline command 3 times, you can read blocks of 4 lines (from the standard input, hence the <).

ADD REPLY
1
Entering edit mode

Thanks for this solution - I tried it and it works fast and nicely. I'm not familiar with awk, so could you please explain why your solution is faster please?

ADD REPLY
0
Entering edit mode

Can you modify the solution so that it works if the sample is sequenced on two different flowcells but, by coincidence, both runs have the same lane number? Also, my data is GZipped FASTQ.

ADD REPLY
0
Entering edit mode

this is faster if you use mawk instead of GNU awk

ADD REPLY
0
Entering edit mode

mawk is often an alias of awk.

$ /usr/bin/awk 
Usage: mawk [Options] [Program] [file ...]
ADD REPLY
0
Entering edit mode

right, but not always, as I recently discovered, some of our systems have awk aliased to gawk, while some of our containers have it aliased to mawk. Resulted in very confusing performance discrepancies.

ADD REPLY
0
Entering edit mode

Totally rad. I love one-liners.

ADD REPLY
0
Entering edit mode

+1 for the paste

ADD REPLY
0
Entering edit mode

I'm wondering, if it would be correct way to work with paired-end reads (not just a single fastq file)? Will the order be the same in the resulted files containing forward and reverse reads? Or may be there is a more safe solution for paired-end reads?

ADD REPLY
1
Entering edit mode

Are you referring to reads from multiple lanes in one file or just interleaved R1/R2 reads from a single lane?

It should be fine to use this solution as long as nothing else has been done to original files. You can do a quick check with repair.sh from BBMap suite after separating the files to make sure the read order is retained post-split.

ADD REPLY
0
Entering edit mode

Yes, i have two fastq files - one with forward and another with reverse reads. In each file there are reads from all 8-th Illumina lanes and i need to split their by lane so that order of reads in all resulted R1 files and correspondig R2 files be the same.

ADD REPLY
4
Entering edit mode
10.7 years ago
Dan D 7.4k

enter image description here

See that highlighted "3" in the first line? That's the lane number in the FASTQ standard. If you read in your FASTQ file and direct your reads to different output files based on that value, you'll have different FASTQ files separated by lane.

Do you need help writing the script to do that?

ADD COMMENT

Login before adding your answer.

Traffic: 2422 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6