Biostar Beta. Not for public use.
Question: Combining The Paired Reads From Illumina Run
9
Entering edit mode

Hi,

I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each.

I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them.

Thanks!

-S.

ADD COMMENTlink 8.9 years ago Surya Saha • 260 • updated 6.7 years ago Eric Normandeau 10k
15
Entering edit mode

(2013-06-09 EDIT: After a comment from a user, I have fixed a bug in the script. The version on GitHub is up-to-date)

This script that I wrote combines two fastq files that have been trimmed and contain orphans. It should be using MUCH less memory than the AWK script for big files:

https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py

It reads one sequence from file1, searches in hash2 is the corresponding sequence has been read yet. If yes, it writes both sequences to files, if not, it adds the sequence to hash1. Then, a sequence is read from file2, with the same pattern until the end of both files.

Here would be an example call to it:

python fastqCombinePairedEnd.py  fastq1  fastq2

NOTE: If it can be assumed that sequences in both files are ordered, the script could be made much more memory efficient at the expense of a bit more computation. For example, if we read sequence X in fastq1 and that it is also in fastq2, it could be assumed that all sequences that have been added to hash1 before can be flushed. This would lead to a very low memory footprint.

ADD COMMENTlink 6.7 years ago Eric Normandeau 10k
Entering edit mode
0

Thanks for great script! Very fast implementation. However, I would like to add that for me it works only in python 2.7.3 environment.

ADD REPLYlink 6.4 years ago
Biomonika (Noolean)
3.1k
Entering edit mode
0

My pleasure :) Do you mean it does not work with Python 3 (which it is not supposed to) or that it does not work in older environments (like Python 2.6)?

ADD REPLYlink 6.4 years ago
Eric Normandeau
10k
Entering edit mode
0

Works a treat,

Thanks Eric!

ADD REPLYlink 4.9 years ago
Jonathan Crowther
• 180
Entering edit mode
1

My pleasure to see that this script is still being used by new people.

ADD REPLYlink 4.7 years ago
Eric Normandeau
10k
Entering edit mode
0

Thanks for the script

ADD REPLYlink 4.4 years ago
dmenning
• 0
Entering edit mode
0

This script is executing fine on my laptop...gives perfect results when I execute it on my laptop. But when I try to execute it on the server, it generates blank files. I don't know why. The operating system of my server is Debian. Do have any Idea like how can I fix this??

ADD REPLYlink 3.4 years ago
swadha
• 20
Entering edit mode
1

The script was written for Python 2.7. There is a possibility it will not work if you are using Python 3.x the server. You can test this with which python.

ADD REPLYlink 3.3 years ago
Eric Normandeau
10k
Entering edit mode
0

It doesn't work for me.

$ python --version

Python 2.7.13

$ which python

/Library/Frameworks/Python.framework/Versions/2.7/bin/python

I run "python fastqCombinePairedEnd.py fastq1 fastq2". My fastq1 is 4.6 MB and fastq2 is 4.5MB. The command run quickly. Two output files (_pairs_R1.fastq and _pairs_R2.fastq) do not have sequences (0 bytes). The xx_singles.fastq is 9.2 MB.

For more information: The header of 1st sequence in fastq1:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=80.0; status=partial; direction=reverse; tail_quality=39.0; reverse_match=ccacctatcacataatcatg; seq_length_ori=151; seq_length=121; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.5267175573; head_quality=33.3; avg_quality=38.2119205298; grab=BI_0_1; reverse_primer=ccacctatcacayaatcatg; 1:N:0:1

The header of 1st sequence in fastq2:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=76.0; status=partial; direction=reverse; tail_quality=38.5; reverse_match=aagggcaccacaagaacgc; seq_length_ori=151; seq_length=122; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.3893129771; head_quality=34.6; avg_quality=38.1456953642; grab=BI_0_1; reverse_primer=aagggcaccacaagaacgc; 2:N:0:1

ADD REPLYlink 2.4 years ago
hexphe
• 10
Entering edit mode
0

As I replied to you on GitHub: If you launch the command with " " at the end, it will tell the script to use a space as a separator. Your name format is a bit strange so possibly you'll need to test some other options. Please report if this solves your problem. If not, please contact me by email with sample files (~100 sequences per file) so we can find a solution.

ADD REPLYlink 2.4 years ago
Eric Normandeau
10k
Entering edit mode
1

The new script you sent to me by email works when I use " " as a separator. Thank you very much for your help!

ADD REPLYlink 2.4 years ago
hexphe
• 10
7
Entering edit mode

I have not tried, but something following this line:

awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read1.fq | sort -S 8G -T. > read1.txt
awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read2.fq | sort -S 8G -T. > read2.txt
join read1.txt read2.txt | awk '{print $1"\n"$2"\n+\n"$3 > "r1.fq";print $1"\n"$4"\n+\n"$5 > "r2.fq"}'

Basically, you convert fastq to TAB delimited format, sort them by read names, join sorted files together and then convert back to fastq.

EDIT: strip "/[12]" in read names.

ADD COMMENTlink 8.9 years ago lh3 31k
Entering edit mode
1

"join" discards orphans. If you want to keep them, you can output orphan lines with "join -a". Writing your own script to join is also easy given two sorted files.

ADD REPLYlink 8.9 years ago
lh3
31k
Entering edit mode
0

how does this deal with orphans?

ADD REPLYlink 8.9 years ago
Jeremy Leipzig
18k
Entering edit mode
0

I have a quite similar problem. Currently, I sort the file and reads one line by one line to do pairing, but that is time and memory consuming.

ADD REPLYlink 8.8 years ago
Gangcai
• 230

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0