I downloaded toy.sam from https://raw.githubusercontent.com/samtools/samtools/develop/examples/toy.sam to check how bedtools calculate the end position. Here is what I have done:
I converted the toy.sam to bam file with samtools and then to bed file with bedtools (toy.bed). Then I used R (with tidyverse) to calculate a comparison table with the following scripts:
sam <- read_delim("toy.sam", delim = "\t", col_names = F, skip = 2)
bed <- read_delim("toy.bed", delim = "\t", col_names = F)
sam <- sam %>% mutate(seq = X10, sam_read_len = str_length(seq))
bed <- bed %>% mutate(difference_in_positions = X3 -X2)
compare_tbl <- bind_cols(sam, bed) %>% select(X1, sam_read_len, difference_in_positions)
Here is the compare table I have obtained
X1 sam_read_len difference_in_positions
r001 19 16
r002 17 10
r003 6 6
r004 12 25
r003 5 5
r001 9 9
x1 20 20
x2 21 21
x3 26 22
x4 25 25
x5 24 24
x6 23 23
Could someone explain why the end position - start position does not always equal to the length of the read sequence from sam file, please? I have always been thinking that bamtobed calculates end position by using sam's start position - 1 + read length (since sam is 1 based and bed is 0 based). This is the case for the last few alignments but not the first few.