BWA gives wrong number of input sequences
1
0
Entering edit mode
5.7 years ago
agata88 ▴ 870

Hi all!

I was trying to map sequences to genome contigs. To do that I've performed commands:

bwa index ref.fasta
bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam
samtools flagstat aln_sorted.bam > aln_stats.txt

input. fasta include 3158 sequences.

Here is aln_stats.txt output:

 5284 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    2126 + 0 supplementary
    0 + 0 duplicates
    4890 + 0 mapped (92.54% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

Why output includs 5284 sequences in total?

I see that 2126 is supplementary, 5284 - 2126 = 3158, which is the number of sequences in input.

Thanks in advance, Best, Agata

bwa • 2.1k views
ADD COMMENT
2
Entering edit mode

bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam

Note that you could simplify this, speed things up and avoid intermediate files:

bwa mem ref.fasta input.fasta | samtools sort -o aln_sorted.bam -

(requiring that your samtools is not too old, but you should use up-to-date software anyway)

ADD REPLY
1
Entering edit mode

Hi Agata, these are most likely sequences mapping to more than one location. The BAM doesn't tell you the number of input sequences but the number of alignments against the reference

ADD REPLY
0
Entering edit mode

Hey, thanks! Do you know how can I find which sequences mapped to different contigs? Best, Agata

ADD REPLY
0
Entering edit mode

Search Biostars for "identify multi-mapped reads". Use an external google search.

ADD REPLY
0
Entering edit mode
5.7 years ago

samtools flagstat is not a very smart program. It's merely counting lines in your bam. If a read is in there twice, because it aligned to multiple places and all are included, samtools flagstat will count it twice. If you omit reads with the 256 binary flag, that will get rid of the secondary alignments.

ADD COMMENT

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6