Samtools To Analyze Bwa Output
1
0
Entering edit mode
12.1 years ago
Dawnoflife ▴ 10

I have a query file with 70,000 lines of sequences and when I do the bwasw each sam output file is about 35 mb and obviously contains all the sequence lines. I need to perform the alignment on about 1000 files, so you can imagine that I don't want to have 35 gigs worth of output to sort. (using cygwin on Windows 7 btw)

Exact bowtie command used:

./bwa bwasw INDEXES/AC_000091.fna QUERY/635plus104_500bp_reads.fasta > results.sam

My input is the index created with

./bwa index AC_000091.fna "AC_000091.fna";

My query file is a FASTA file. The sequences are 500bp, i am only pasting some of them.

> 1
CATGACTGATTCACGCCGTTCGGGGTTATTAAACCAAACTCGCCCTGCAGGTGTGGCAACATATCCA
> 2
ACGGCCGCAGCCATAATGGTCCAATCAGCTTGAGGATATGCGGCTAACATTGCAGCACGCATTTCTG
> 3
GATGCGGGTAACGGCATCCAAAAATCAAACCGGGCTGAATAGCTTCGTAAGCCATATGCTGAGCAGT
> 4
CGTACCATCGGCATTCAATTGGGCACCGTACAGCATGCCTTTCTCCAGGAGTTTGGAGTTGGCTTTG
> 5
ATCTCCGGTGGGTTCAACCCCTACTTTTGCGAGTAAGGCGTCCTGGCCATGATCGGCAAGGGCCATA
> 6
GTAGAAAAGTCTGCTCCAGAGACAGTCCGTGAGCAAGATGCATCAATTGAAGCCTCGGATGAGGCCGA
> 7
CAGTCGGTGGGTGGCTGAGAGCTTAAGGATGGCTTTGGTTTGGGCGGCTTTGGGATTTTTGATATTTT
> 8
GCCCAAAGTGACTGAACTAAAGACAATCGGAAACAGGTTAGCGGCTAACACAAACCCCAGTATGGACA
> 9
TCGGGATTGGGCATCAAAAGGGATTTCAACTGTGCCTGCTAATCCCACCAAATCATCATGGCTAATTA
> 10
AGCGCTATCCTCCTTATTCATACAATAGTGATTGTTCTGTTGTCTCCACTGCTGCTCATATTCACAGG
> 11
TAGGGAACATGCCCCCCAAGCGATTTCAAAATGGGGGATTCAGGCGATTATTGGCGAGAGTTTTGCCG

I am trying to get rid of the non-aligned reads from the SAM output file ans I searched through forums and tried the following code with no avail.

./samtools view -bS -f 4 testing.sam > output.bam

All my sam files give the following error:

`[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '0' is recognized as '*'.
Parse warning at line 3: mapped sequence without CIGAR
Parse error at line 3: missing colon in auxiliary data
Aborted (core dumped)
`

Here's what a part of my sam files looks like

@SQ    SN:gi|89106884|ref|AC_000091.1|    LN:4646332

    4    *    0    0    *    *    0    0    <CGTAAAAAGGA>     *
    4    *    0    0    *    *    0    0    <TAGCCGTAGGG>   *

I'd appreciate help with this!

samtools bwa output • 8.2k views
ADD COMMENT
0
Entering edit mode

That sam output looks odd. How exactly did you generate it?

ADD REPLY
0
Entering edit mode

I agree the output looks odd, there are no sequence names f.e. Could there be a problem with the input file, lacking proper names etc.? Could you please post one (upload to public ftp/web) of your input files (or at least the first 100 sequence files) + the exact command line of bwa, I assume it's fastq input (if it was fasta you wouldn't have the non matching output with bwasw)?

ADD REPLY
0
Entering edit mode

Added the code i used for bwa and my query file! Thank you for the help!

ADD REPLY
0
Entering edit mode

I will try that out tomorrow. Your reference sequence is AC_000091.1 in RefSeq (E. coli K12 substr. W3110)? It is marked obsolete and removed in NCBI nucl., not that it matters here, but just wanted to note.

ADD REPLY
0
Entering edit mode

Ha yes I noticed that too. My data is just random, i want to check the efficiency of bwa and bowtie etc.

ADD REPLY
3
Entering edit mode
12.1 years ago
Michael 54k

I have just tried it and found the reason for the error. First, with your input you are providing I see exactly the same error.

it's the id line in your query file:

> 1
ATGC...
> 2
ATGCG

That doesn't seem to work with a space between the > and the identifier, so simply remove the space.

Try running your input file through the following awk command (I hope cygwin includes awk):

 awk '{sub(/> /,">");print}' query.fna > query2.fna

Then running the following commands on query2.fna I get:

Confucius:test mdondrup$ bwa bwasw AC_000091.fna query2.fna > results.sam
[bsw2_aln] read 11 sequences/pairs (743 bp)...
[main] Version: 0.6.1-r104
[main] CMD: bwa bwasw AC_000091.fna query2.fna
[main] Real time: 0.010 sec; CPU: 0.012 sec
Confucius:test mdondrup$ samtools view -bS -f 4 results.sam > output.bam
[samopen] SAM header is present: 1 sequences.
ADD COMMENT
0
Entering edit mode

Will you delete my post as it has been wrongly accepted as the answer? Cheers.

ADD REPLY
0
Entering edit mode

Ah nope, I think you could do that yourself if you wanted, but just leave it, it documents try-and-error strategy in debugging. Further, it's up to the OP to decide on the accepted answer, even if contradicting evidence, that's how the system plays out.

ADD REPLY
0
Entering edit mode

This works really well! Thank you!!! Is there a way I can save just the aligned sequences in the output file? I can't open bam with notepad.

ADD REPLY

Login before adding your answer.

Traffic: 1999 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6