Question

How To Convert Sra-Lite Paired-End Submission To Fastq?

15

Entering edit mode

12.7 years ago

Casey Bergman 18k

I'm having some trouble converting an Illumina paired end accession from NCBI's SRA to the paired _1 and _2 fastq files using fastq-dump from the SRA toolkit. I'm running fastq-dump version 2.1.0 (June 22, 2011) and following instructions from the NCBI website here.

When I download this (or other accessions from the same project) and convert to fastq, one or the other of the _1 and _2 fastq files has 2x as many sequences, with all of the sequences from the smaller file being included in the larger file, e.g.

$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump -A SRR189044 ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG

==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG

Notice how the _1 file has two reads named SRR189044.1, one of which is the corresponding read in the _2 file.

I've checked with the data submitters and NCBi and it looks like there is no duplication of data in the original submission. There is a related post on SeqAnswers that unfortunately does not address or help solve this issue. Any ideas on what might be going on here would be appreciated.

Many thanks, Casey

sra paired fastq conversion • 27k views

ADD COMMENT • link updated 8.7 years ago by Maximilian Haeussler ★ 1.6k • written 12.7 years ago by Casey Bergman 18k

0

Entering edit mode

Adding this self-Q&A to help others with the same problem.

ADD REPLY • link 12.7 years ago by Casey Bergman 18k

score 24 · Answer 1 · 2011-08-11

The problem you are experiencing is that the version of the SRA toolkit is out of date and that there is now an un(der)documented option in fastq-dump to dump paired end data from an SRA-lite submission. The guidance notes on the NCBI website you refer to are for version 2.0.1, and state that they are not up to date:

This guide is current to SRA Toolkit version 2.0.1 release candidate 1. Instructions for previous versions of the SRA Toolkit may be different from those provided in this guide. We recommend that users stay current with SRA Toolkit updates to benefit from feature additions and bug fixes.

In the latest version of SRA tool 2.1.2 (July 26 2011), there are now options to split paired end reads into separate file:

 --split-files                    Dump each read into a separate file.Files will received suffix corresponding to read number
 --split-3                        Legacy 3-file splitting for mate-pairs:
                                  First 2 biological reads satisfying dumping conditions
                                  are placed in files *_1.fastq and *_2.fastq
                                  If only 1 biological read is dumpable - it is placed in *.fastq

The explanation of the "--split-files" option says that each read will be dumped into a separate file, which is ambiguous and could mean that every reads is put into a separate file. It actually means that each read from a mate pair is put into a _1 or _2 fastq file, which is the desired outcome.

For your example, you should upgrade to SRA Tools v2.1.2 and run the following commands:

$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump --split-files ftp-trace.ncbi.nlm.nih.gov/sra/srainstant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
#4767;;7<:>?@@##############################################################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
AACAGATTGTATATGTGTTTTTTTTACATGGCTCATTGGCAAATGTTTTTGNNNNATCGAAATCTTTCTCGTATAC

==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG

Hope this helps!

score 0 · Answer 2 · 2015-08-13

0

Entering edit mode

8.7 years ago

Maximilian Haeussler ★ 1.6k

I had the same problem. This help message is prone to misunderstanding: "Dump each read into a separate file".