Downloading SRA Reads
2
0
Entering edit mode
7.5 years ago

I am using SRA Toolkit for SRA Reads from NCBI web page. I went through the mannual and came up of using this command: ~/bin/sratoolkit.2.7.0-centos_linux64/bin/fastq-dump --split-files --fasta 60 SRR1981235 I thought I will download the SRA files and will split into two files. It created two fasta files as fasta.1 and fasta.2 (paired reads). But for aligment it says the number of reads are not equal. I am very new for this and I need advise. Regards!

RNA-Seq next-gen • 3.0k views
ADD COMMENT
0
Entering edit mode

Try --split-3 rather than -split-files.

ADD REPLY
0
Entering edit mode

1) What program are you using to do the alignment? tophat?

2) Did you do any trimming or QC between downloading the fastq files and alignment?

ADD REPLY
0
Entering edit mode

you may not have downloaded the files completely can you run:

 wc -l fasta.1

to give you the line count?

ADD REPLY
0
Entering edit mode

Hi,

I had this issue as well.It happens all the time with SRA toolkit while downloading the sra files and converting them to fastq. Sometimes you need to redownload the SRA files all over again.This should work.

ADD REPLY
1
Entering edit mode
7.5 years ago
BioinfGuru ★ 1.7k

Ive just checked your command and downloaded the 2 fastq files and followed it with the line count (wc -l) with these results:

41319291 SRR1981235_1.fasta

41319291 SRR1981235_2.fasta

So the lines are all paired - I think youve had an interuppted download....try it again. both files downloaded should be 2.4G each.

Then do wc -l to make sure you get the same result as me. All should be good then.

ADD COMMENT
1
Entering edit mode

41319291 SRR1981235_1.fasta

if theses FASTQ files were generated with fastq-dump, the number of lines has to be a multiple of 4. An uneven number of lines indicates a problem.

SRR1981235 reads were obtained from Helicobacter pylori, which has a small chromosome of only about 1.600.000 nucleotides. SRR1981235 has an absurd coverage of more than 1000 fold.

There are more than 500 read sets available for Helicobacter pylori in the short read archive. Just take another one.

https://www.ncbi.nlm.nih.gov/sra/?term=txid210[Organism:exp]

ADD REPLY
0
Entering edit mode

I remember reading that somewhere too lol

ADD REPLY
0
Entering edit mode

For this SRA number even after using the above mention --split-3 I am getting unequal size files SRR1981108. Please can you guide me through this?

ADD REPLY
2
Entering edit mode

Actually, you can just download the fastq files from ddbj. Then you can skip the annoyance of using SRA.

ADD REPLY
0
Entering edit mode

oh thats a handy link...thank you! However: is the SRA toolit pointing out problems in the SRR1981108 data set that should not be ignored just by bypassing the SRA toolkit? If so, would these be spotted anyway with FASTQC?

Instructions:

Search tool: https://trace.ddbj.nig.ac.jp/DRASearch/

Enter Accession ID: SRR1981108

Click link under "submission" (SRA259635)

ctrl F "SRR1981108"

Click link "FASTQ"

Download SRR1981108_1 + SRR1981108_2 (these are compressed .bz2 files)

Extraction options (Ive not tried this step):

1) linux terminal with: bzip2 -dk filename.bz2 (i dont think tar will work)

2) windows: 7zip/winrar....any archiving software

ADD REPLY
0
Entering edit mode

Thank you for the advise. Will try this one out.

ADD REPLY
0
Entering edit mode

Hie, can you please help me with this -- from ddbj how you came to know about the number SRA259?

Thanx

ADD REPLY
0
Entering edit mode

I don't think such an accession number exists. Anyway, use the search page linked to by kennethcondon2007 above.

ADD REPLY
0
Entering edit mode

Then whoever uploaded it uploaded something broken. The only thing you can do is resync the files with reformat.sh from BBMap (or an equivalent).

ADD REPLY
0
Entering edit mode

Output of SRR1981108 is problematic -

kwc17@Bioinftop2016-05 ~/Desktop/new $ fastq-dump --split-files --fasta 60 SRR1981108

2016-10-04T20:27:39 fastq-dump.2.6.3 warn: too many reads 33 at spot id 9935, maximum 32 supported, skipped

2016-10-04T20:27:39 fastq-dump.2.6.3 warn: too many reads 43 at spot id 19767, maximum 32 supported, skipped

2016-10-04T20:27:52 fastq-dump.2.6.3 warn: too many reads 77 at spot id 37494, maximum 32 supported, skipped

2016-10-04T20:27:56 fastq-dump.2.6.3 warn: too many reads 65 at spot id 43137, maximum 32 supported, skipped

2016-10-04T20:28:23 fastq-dump.2.6.3 warn: too many reads 69 at spot id 88649, maximum 32 supported, skipped

2016-10-04T20:28:24 fastq-dump.2.6.3 warn: too many reads 41 at spot id 91070, maximum 32 supported, skipped

2016-10-04T20:28:35 fastq-dump.2.6.3 warn: too many reads 117 at spot id 104330, maximum 32 supported, skipped

2016-10-04T20:28:37 fastq-dump.2.6.3 warn: too many reads 37 at spot id 109128, maximum 32 supported, skipped

2016-10-04T20:28:49 fastq-dump.2.6.3 warn: too many reads 47 at spot id 125262, maximum 32 supported, skipped

2016-10-04T20:28:51 fastq-dump.2.6.3 warn: too many reads 99 at spot id 129723, maximum 32 supported, skipped

2016-10-04T20:28:54 fastq-dump.2.6.3 warn: too many reads 41 at spot id 132767, maximum 32 supported, skipped

2016-10-04T20:29:01 fastq-dump.2.6.3 warn: too many reads 55 at spot id 138644, maximum 32 supported, skipped

Rejected 12 SPOTS because of to many READS
Read 163482 spots for SRR1981108
Written 163467 spots for SRR1981108

kwc17@Bioinftop2016-05 ~/Desktop/new $ ls
SRR1981108_1.fasta   SRR1981108_19.fasta  SRR1981108_3.fasta
SRR1981108_10.fasta  SRR1981108_2.fasta   SRR1981108_4.fasta
SRR1981108_11.fasta  SRR1981108_20.fasta  SRR1981108_5.fasta
SRR1981108_12.fasta  SRR1981108_21.fasta  SRR1981108_6.fasta
SRR1981108_13.fasta  SRR1981108_22.fasta  SRR1981108_7.fasta
SRR1981108_14.fasta  SRR1981108_23.fasta  SRR1981108_8.fasta
SRR1981108_15.fasta  SRR1981108_24.fasta  SRR1981108_9.fasta
SRR1981108_16.fasta  SRR1981108_25.fasta  SRR1981235_1.fasta
SRR1981108_17.fasta  SRR1981108_26.fasta  SRR1981235_2.fasta
SRR1981108_18.fasta  SRR1981108_27.fasta

You get 28 tiny output files Instead of 2 (see the original SRR1981235_1.fasta + SRR1981235_2.fasta)

I don't know how to handle this priblem...apart from removing this SRR from the data set of course. Does it have to be included?

ADD REPLY
0
Entering edit mode
7.5 years ago

I have question. When I used --split-3 rather than -split-files, I got three output. When I looked for manual of fastqdump it says that the reads satisfying biological condition are put as _1.fasta and _2.fasta. And if only one read present it is given as fasta. Here I do not understand for what fastq is considering as biological condition. Can anyone helo me for this please?

ADD COMMENT
1
Entering edit mode

Sometimes people included orphaned reads in their submissions, just delete the third file.

ADD REPLY
0
Entering edit mode

id be willing to bet that the quality of the reads in the first 2 files is poor....dont forget to QC your fastq files

ADD REPLY

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6