Biostar Beta. Not for public use.
Question: Fastq Splitter For Paired End Reads
10
Entering edit mode

Hi!!

I have to analyze paired-end RNA-seq read that are in an unusual format: both pair-end reads are joined in one FASTQ. I need to split the file in two separated FASTQ paried-end files.

There are a galaxy tool named FASTQ splitter that can do this:


FASTQ splitter

What it does

Splits a single fastq dataset representing paired-end run into two datasets (one for each end). This tool works only for datasets where both ends have the same length.

Sequence identifiers will have /1 or /2 appended for the split left-hand and right-hand reads, respectively.

Input format

A multiple-fastq file, for example:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758
GTCAATTGTACTGGTCAATACTAAAAGAATAGGATCGCTCCTAGCATCTGGAGTCTCTATCACCTGAGCCCA
+HWI-EAS91_1_30788AAXX:7:21:1542:1758
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh`hfhhVZSWehR

Outputs

Left-hand Read:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758/1
GTCAATTGTACTGGTCAATACTAAAAGAATAGGATC
+HWI-EAS91_1_30788AAXX:7:21:1542:1758/1
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Right-hand Read:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758/2
GCTCCTAGCATCTGGAGTCTCTATCACCTGAGCCCA
+HWI-EAS91_1_30788AAXX:7:21:1542:1758/2
hhhhhhhhhhhhhhhhhhhhhhhh`hfhhVZSWehR

Do you know any other standard alone script that can do this job?

ADD COMMENTlink 7.9 years ago Geparada ♦ 1.4k • updated 5.3 years ago smilefreak • 420
Entering edit mode
6

Just as a side note, this output is what you get when you use 'fastq-dump' of the SRA tools to download paired-end data from the SRA archive. In that case, there is a nice option: 'fastq-dump --split-files' which will output one file for all the /1 pairs and another file for the /2 pairs.

ADD REPLYlink 5.9 years ago
Leonor Palmeira
3.7k
Entering edit mode
0

I tried this command on interleaved fq containing paired reads from both ends, but it did not work even if I changed .fq to .sra. Should it have worked?

ADD REPLYlink 5.5 years ago
trakhtenberg
• 150
Entering edit mode
1

Hi trakhtenberg,

Hopefully you were able to figure this out given how long ago this was, but for posterity of others looking for answers, the fastq-dump does not act on local files. It instead downloads files from SRA and converts them into fastq files for you. If the SRA accession number (usually SRR#######) stores paired-end reads, you should use the following command:

fastq-dump --split-files -O path/to/output SRR#######

More information about fastq-dump and other SRA toolkit utilities can be found here: http://www.ncbi.nlm.nih.gov/books/NBK158900/

(edit to make code clearer)

ADD REPLYlink 4.0 years ago
warren-mcgee
• 30
11
Entering edit mode

copy this code to a file, e.g. splitPairedEndReads.pl

use strict;
use warnings;
my $file = $ARGV[0];
open(FILE, "<$file") || die "cannot open $file\n";
open(OUT1, ">$file\_1") || die "cannot open $file\_1\n";
open(OUT2, ">$file\_2") || die "cannot open $file\_2\n";
while(<FILE>){
    chomp;
    print OUT1 "$_\/1\n";
    print OUT2 "$_\/2\n";
    my $newline = <FILE>; chomp($newline);
    print OUT1 substr($newline, 0, length($newline)/2)."\n";
    print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
    $newline = <FILE>; chomp($newline);
    print OUT1 "$newline\/1\n";
    print OUT2 "$newline\/2\n";
    $newline = <FILE>; chomp($newline);
    print OUT1 substr($newline, 0, length($newline)/2)."\n";
    print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
}
close(FILE);

Now call it with ./splitPairedEndReads.pl input.fastq. I hope it works.... haven't checked it.

ADD COMMENTlink 7.9 years ago David Langenberger 8.9k
Entering edit mode
0

@david : nice simple script: its splits to half: what if datasets where both ends have the not same length? thats what Geparada might be looking for?

ADD REPLYlink 7.9 years ago
Rm
7.8k
Entering edit mode
0

Well, if they don't have the same length, then there has to be an information about the position to split. In his example, I cannot find anything like that. Furthermore, this format (split in the middle) is well known.

ADD REPLYlink 7.9 years ago
David Langenberger
8.9k
Entering edit mode
0

But it's simple to change the script and split on different positions.

ADD REPLYlink 7.9 years ago
David Langenberger
8.9k
Entering edit mode
0

I agree with that!!

ADD REPLYlink 7.9 years ago
Rm
7.8k
Entering edit mode
0

@ David: Thanks a lot for the Script. It is exactly wha was looking for. However, unfortunately it does not really do what it is supposed to. Here is the head of the original file:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG GCTAGGGATGACTCCAGAGACTGATAATATTTCTGAACAAAATGATATCAATTCCTTATTACTACCAAACTCAGAAAATAAAGATTTACCTCGACCAT +HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG DFFFFHHHHHJJJJJJJJJJJJJIJJJIJJJJJJJIJJJJJJJJJJJJIIJJJIIJJJJJJJJJJJJJJJJJHIJHHHHHFFFFFFFEEECEBDDDDD @HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG GATTCCATTCATTCATTGTAGTTTCACTTGAATTTAGCTCAGAATTCTGTGTATAAGCAGGTGAAGGCATATCACTTTGGTACCAGGTGGAAGAACTC +HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG FDFFFHGHHHJJJJJJJJJIJJJJJJJJIJHIIJIIJJJJJJJIIJJGHIHHJIJJJJJJJBFGHIJCGGIJIJJJJJFIHIJHFHHCDFEDEEEEEE @HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG CGTTGGAGTGTTTGCGTTGCGAAGCTGCTGCAACCGTTGGCAGCTGATTTGAACTGTTCAGTTGTTGGCAGAAGGTAGGAGGCATTCCCGGGCTGG

and here after splitting:

File1: @HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/1 GCTAGGGATGACTCCAGAGACTGATAATATTTCTGAACAAAATGATATC +HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/1 DFFFFHHHHHJJJJJJJJJJJJJIJJJIJJJJJJJIJJJJJJJJJJJJI @HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/1 GATTCCATTCATTCATTGTAGTTTCACTTGAATTTAGCTCAGAATTCTG +HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/1 FDFFFHGHHHJJJJJJJJJIJJJJJJJJIJHIIJIIJJJJJJJIIJJGH @HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG/1 CGTTGGAGTGTTTGCGTTGCGAAGCTGCTGCAACCGTTGGCAGCTGAT

File2: @HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/2 AATTCCTTATTACTACCAAACTCAGAAAATAAAGATTTACCTCGACCAT +HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/2 IJJJIIJJJJJJJJJJJJJJJJJHIJHHHHHFFFFFFFEEECEBDDDDD @HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/2 TGTATAAGCAGGTGAAGGCATATCACTTTGGTACCAGGTGGAAGAACTC +HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/2 IHHJIJJJJJJJBFGHIJCGGIJIJJJJJFIHIJHFHHCDFEDEEEEEE @HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG/2 TTGAACTGTTCAGTTGTTGGCAGAAGGTAGGAGGCATTCCCGGGCTGG

I guess that the script did not realize the paired information that is given with the number (1 or 2) after the first space (CASAVA 1.8). I know that this is exactly the problem with other scipts that do not work anymore after the Illumina CASAVA 1.8 Update. What would I have to change to split the above shown format? Thanks a lot for any help on this issue!

ADD REPLYlink 7.9 years ago
Nico Posnien
• 0
Entering edit mode
0

At a first glance, it seems there is a problem with your format. Directly after the quality values in the third line, there seems to be no newline. Is that a problem of the editor here, or is it in you file? There have to be 4 lines for each entry!

ADD REPLYlink 7.9 years ago
David Langenberger
8.9k
Entering edit mode
0

I tried the head of your file (after manually inserting the newlines) and the output seems to be correct.

ADD REPLYlink 7.9 years ago
David Langenberger
8.9k
8
Entering edit mode

For the left read:

awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }'

for the right read:

awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2) }'
ADD COMMENTlink 7.9 years ago Marvin • 850
Entering edit mode
4

Very elegant, but string indices in awk are 1-based so that as is, above code duplicates the last base of the first read as the first base of the second read. To avoid that, instead run:

awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2+1) }'

for 2nd/right read. (Code for 1st/left read can remain unchanged since awk substr returns the same result for both 0 and 1 as the start index.)

ADD REPLYlink 3.8 years ago
rschulz
• 40
Entering edit mode
0

Thank you for this indexing fix, @rschulz, and for the general method, @Marvin!

I used this method to separate some PE reads downloaded through NCBI's sratools fastq-dump. This works just great with the fixed indexing at the end of the "right read" function.

ADD REPLYlink 3.3 years ago
paanvaannd
• 0
Entering edit mode
0

much more simpler and efficient than perl or python

ADD REPLYlink 3.9 years ago
tiago211287
♦ 1.1k
2
Entering edit mode

Thanks for your script David Langenberger!

I don't understand perl scripts, so I decided to make my own python script, but again thanks for your time!

import sys
from Bio import SeqIO
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
from Bio.SeqRecord import SeqRecord

def main(fastq):

    Rd1_out_name = fastq.replace(".fastq", "Rd1.fastq")
    Rd2_out_name = fastq.replace(".fastq", "Rd2.fastq")
    Rd1_out = open(Rd1_out_name, 'w')
    Rd2_out = open(Rd2_out_name, 'w')

    for record in SeqIO.parse(fastq, "fastq"):

        seq = str(record.seq)
        Rd1_seq = seq[:len(seq)/2]
        Rd2_seq = seq[len(seq)/2:]                                  
        Q = record.letter_annotations["phred_quality"]
        Rd1_Q = Q[:len(Q)/2]
        Rd2_Q = Q[len(Q)/2:]            
        Rd1_id = record.id.strip("/1").strip("/2") + "/1"
        Rd2_id = record.id.strip("/1").strip("/2") + "/2"

        Rd1 = SeqRecord( Rd1_seq , id = Rd1_id, description = "" )
        Rd1.letter_annotations["phred_quality"] = Rd1_Q         
        Rd2 = SeqRecord( Rd2_seq , id = Rd2_id, description = "" )
        Rd2.letter_annotations["phred_quality"] = Rd2_Q

        Rd1_out.write(Rd1.format("fastq"),)
        Rd2_out.write(Rd2.format("fastq"),)


if __name__ == '__main__':
    main(sys.argv[1])
ADD COMMENTlink 7.9 years ago Geparada ♦ 1.4k
Entering edit mode
0

Great idea! I think in Galaxy they ask you to cite them, when using the tool, don't they? Haha! I think you can publish everything....

ADD REPLYlink 7.9 years ago
David Langenberger
8.9k
Entering edit mode
0

I just try to don't use galaxy because I run my pipelines by shell scripts ...

ADD REPLYlink 7.9 years ago
Geparada
♦ 1.4k
Entering edit mode
0

I just use it sometimes to test some tools... when I'm too lazy to install it. But for high-throughput analysis it's completely useless.

ADD REPLYlink 7.9 years ago
David Langenberger
8.9k
1
Entering edit mode

Using List Comprehension in python

fastq_1 = open('fastq_r1.fastq')

fastq_2 = open('fastq_r2.fastq')

[r1.write(line) if (i % 8 < 4) else r2.write(line) for i, line in enumerate(open('test.fastq'))]

fastq_1.close()

fastq_2.close()
ADD COMMENTlink 5.3 years ago smilefreak • 420
Entering edit mode
1

Elegant and simple, but you necro'd a 2.6-year-old old topic :)

Edit: might as well contribute anyway. I'd recommend wrapping using 'with' to make it even more idiomatic:

fastq_1 = open('fastq_r1.fastq')

becomes

with open('fastq_r1.fastq', 'a') as fastq_1:

etc., in case you forget to close your file handles.

ADD REPLYlink 5.3 years ago
Brice Sarver
♦ 2.6k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0