Question

Samtools view: Sequence of reads given by samtools view differs from read sequence in raw sequence files

1

Entering edit mode

7.8 years ago

daniel.cole.w ▴ 10

Hi,

I am using samtools view to get reads covering a specific position. Howerver, these reads do not match the reads with the same id in the raw sequence files.

For example, say I run samtools view <file>.bam <position> I get a read with id x and sequence AAAA... Then, looking for reads with id x in the raw sequence files I get a read with sequence GCTTA...

Is samtools view reliable for a given position? Is it possible that my read ids got mixed up during mapping?

Thanks,

Daniel

samtools software error stampy • 2.8k views

ADD COMMENT • link updated 7.8 years ago by Brian Bushnell 20k • written 7.8 years ago by daniel.cole.w ▴ 10

score 2 · Answer 1 · 2016-06-21

2

Entering edit mode

7.8 years ago

Brian Bushnell 20k

Unfortunately, the sam/bam specification requires sequences to be reverse-complemented when the primary alignment is to the minus strand of the reference. This causes endless confusion for questionable benefit in rare cases. At any rate, I suggest that you test whether the reverse-complement looks like the original read.

For similarly questionable reasons, the sam format requires that the names of read 1 and read 2 in pairs match. So if you originally had a pair named "100:255:7 1:ACCTGT" and "100:255:7 2:ACCTGT", generating a sam file necessitates the lossy transformation of giving them both the same name, which could mean truncating the names at the first space so they are both named "100:255:7", or just using read 1's name so they are both named "100:255:7 1:ACCTGT". So it's also important to check if read 2, or maybe read 2's reverse-complement, matches what you expect.

I have found some versions of samtools to be very reliable, once you accept these deficiencies in the format specification.

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

1

Entering edit mode

There are indeed downsides with regard to reverse complement, but there are multiple benefits you have overlooked. 1) better compression. 2) Easier for programmers. If BAM kept reads in the original strand, every pileup, alignment viewer, variant caller, NM/MD recomputer and quality recalibrator would have to reverse complement half of reads. 3) Making the sequence consistent with CIGAR on the reference strand. Note that for variant calling, you can hardly use CIGAR on the read strand; that way you would need to left align gaps for forward reads and right align gaps for reverse reads, which is hugely confusing. 4) Easier for eyeballing.

There was a long discussion on how to keep paired-end reads. We evaluated alternatives like requiring /[12] in read names, adding a new fixed column, using tags, etc. The consensus is the current SAM way. I have concerns myself, but I see that is the best we could come up with. If you have better ideas, let me know. Also on the fasta/fastq format, it is wrong to say "truncating the names at the first space". Everything after the first space is not part of the read name. You never truncate names at spaces.

ADD REPLY • link 7.8 years ago by lh3 33k

0

Entering edit mode

Thanks for your reply! So with my Ids truncated at the space, there is no way to tell which read came from my R1 or R2 files? Except for checking the sequence?

ADD REPLY • link 7.8 years ago by daniel.cole.w ▴ 10

0

Entering edit mode

No, you can find that out by decoding the bitflag (second field). This is a decimal integer. If you translate it into binary, then if the 8th bit is set, it was read 2; otherwise, it was read 1.

You can do this in Windows 7 (and presumably other versions) with the calculator. Got to view -> programmer. Enter the number (say, 231), then switch from decimal to binary, and it will display the number in binary (11100111). In this case, the 8th bit is set, meaning it came from read 2.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k