Question

Sra Format Questions: How To Get A Sam File That Can Be Converted To Bam?

5

Entering edit mode

10.2 years ago

bede.portz ▴ 540

I'm trying to obtain some published chip-seq data from another lab that is stored in the SRA. I have downloaded and installed the SRA toolkit. I am having some problems obtaining a SAM file, that I can convert to BAM, and ultimately, BED. I was hoping Biostars could clarify some things, I found the SRA toolkit documentation to be difficult to understand.

First of all, do all SRA files contain information that can be ouput as a sam file using sam-dump? Is there a standard set of data types that are included in an SRA file, or can it vary from dataset to dataset? If so, how does one know what type of data is included in a given SRA file?

I may well be misunderstanding many things, but I will include the steps I have attempted that ultimately result in a .bam file that cannot be converted to a .bed file. Any help in learning to access SRA files and transfer them to sam and then bam would be much appreciated.

First I try to obtain a sam file using sam-dump

$sam-dump SRR490207.sra > SRR490207.sam

Here is the first few lines of the output:

 HWUSI-EAS1694_0008:5:1:1061:20521    4    *    0    0    *    *    0    0    TGTGATCTGACCTTACCAATCTTTNCNNNNNNNNNN    DFFFBFDDFFFFFFFFFFFB@=B@############
HWUSI-EAS1694_0008:5:1:1061:19640    4    *    0    0    *    *    0    0    AATTCTACAACATCTCCAACAAATNTNNNNNNNNNN    DDDFFFFFFFFFEFEFFFFFCDC#############
 HWUSI-EAS1694_0008:5:1:1061:6084    4    *    0    0    *    *    0    0    TATCCCAAGCTACTCCGGGCCTGCNTNNNNNNNNNN    GGGGGEFFBFDGGFGFF?EECAAB############

Next I try to convert the .sam file to a .bam file using samtools, using the code:

$samtools view -S -b SRR490207.sam > SRR490207.bam

I get the following message:

[samopen] no @SQ lines in the header.

I do get a .bam file, which I try to sort using samtools, but again I get an error regarding the header:

samtools sort SRR490207.bam sortedSRR490207 
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_sort_core] truncated file. Continue anyway.
[bam_sort_core] merging from 3 files...

Finally, When I attempt to convert the .bam file to a bed file, using bamToBed I get an empty .bed file.

Can anyone help me take an SRA file to a .bam file?

Thanks

sam sra chip-seq samtools • 16k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 10.2 years ago by bede.portz ▴ 540

Ram · Answer 1 · 2014-02-09

The first thing you want to understand is that not every SRA entry has aligned reads. Since you want a BED file, which contains features, you will be interested in the aligned data. There is not a great way to check on this. First way is to use the SRA trace browser and look at the summary page (see the page for the accession in your example). There should be an "alignment" tab if there are aligned data in the SRA accession. In your example there is not.

Just for completeness I'll continue though. The second way to check for aligned data is to run the vdb-dump tool on your .sra archive and look for ALIGNMENT_COUNT in the resulting output.

There are no @SQ lines in the header because the @SQ lines are the "sequence library" which is a representation of the contigs in the reference that the reads were aligned to. Since your reads are not aligned this information is not present.

Finally your .bed file is empty because there are no alignments to extract features from.

As a possible solution you might try the fastq-dump tool to extract the reads in FASTQ format, and since the data are annotated as human ChIP-seq data you can use either bowtie2 or bwa to align the reads to a human reference genome.

Ram · Answer 2 · 2014-02-09

1

Entering edit mode

10.2 years ago

Gabriel R. ★ 2.9k

wait... that has nothing to do with the SRA, the convertion to bam has had a problem. Can you do the following for me post the output of:

cat SRR490207.sam | wc -l

and

samtools view SRR490207.bam | wc -l

They should be the same, I suspect that they are not...

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 10.2 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

$ wc -l SRR490207.sam
29915343 SRR490207.sam

$ samtools view SRR490207.bam | wc -l
[bam_header_read] EOF marker is absent. The input is probably truncated.
[main_samview] truncated file.
8897161

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 10.2 years ago by bede.portz ▴ 540

0

Entering edit mode

well there you go, something went awry during the sam->bam conversion. reconvert to bam and make sure both files tally to the same number. good luck!

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 10.2 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

There is definitely an issue with the bam file being incomplete, but I believe that even the complete bam file will be insufficient for the OP's analysis since it is all unmapped reads.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 10.2 years ago by Matt Shirley 10k

score 1 · Answer 3 · 2017-10-11

Since someone appears to have touched this thread today (up-votes) I will add that a new blast program magicblast available from NCBI can directly accept SRA accession ID's and generate a SAM file (that can be converted BAM following usual means).

Edit: Included here for information. Make your own decision about using it.

score 0 · Answer 4 · 2014-04-16

0

Entering edit mode

10.0 years ago

Ming Tommy Tang ★ 3.9k

those sra files are not aligned. use fastq-dump first to get fastq file and then map with bowtie or tophat(if it is RNA-seq)

ADD COMMENT • link 10.0 years ago by Ming Tommy Tang ★ 3.9k