Question

Difference in htseq-count values for sorted bam and unsorted bam by name

1

Entering edit mode

8.5 years ago

nalandaatmi ▴ 100

Dear All,

I am working on getting the read counts for some genes. Firstly, I aligned my trimmed RNASeq reads with human and mouse genome to get alignment file (accepted_hits.bam) in bam format using tophat. I provided htseq-count tool with two different bam files (sorted and unsorted).

Case1: I used the accepted_hits.bam from tophat directly as the input to the htseq-count and counted the reads for each gene.

$ htseq-count -f bam -s yes -i gene_id accepted_hits.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt

Note: By default accepted_hits.bam file is sorted based on coordinates by tophat.

Case2: I sorted the accepted_hits.bam file based on "NAMES" using samtools. Then provided as input to the htseq-count and counted the reads for each gene.

$ samtools sort -n accepted_hits.bam accepted_hits_sorted.bam
$ htseq-count -f bam -r name -s yes -i gene_name accepted_hits_sorted.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt

                           Unsorted-Bam      Sorted-Bam
Globin Genes               Stranded: YES     Stranded: YES
HBB                        36472             19800
HBA1                       20398             4249
HBA2                       144174            47553
HBG1                       2825              2384
HBG2                       2638              1942
HBD                        4                 3
HBE1                       0                 0
HBZ                        0                 0
HBQ1                       3                 2
MB                         0                 0
CYGB                       2                 1
NGB                        2                 1
Total Globin               206,518           75,935
__no_feature               94,038,451        51,420,979
__ambiguous                46,175            56,695
__too_low_aQual            0                 0
__not_aligned              0                 0
__alignment_not_unique     32,018,089        16,952,589
Total reads                126,969,861       68,900,778
% of Globin_reads          0.1626            0.1102

Can anyone explain why the total reads value is low for sorted bam compared to unsorted bam?

RNA-Seq htseq-count sequencing SNP • 9.5k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.5 years ago by nalandaatmi ▴ 100

Ram · Answer 1 · 2015-10-15

4

Entering edit mode

8.5 years ago

Devon Ryan 104k

The name sorted results are correct, the coordinate ones are not because you lied to htseq-count and told it you had a name-sorted file (it's the default, you don't need to say -r name) instead of the coordinate-sorted file that you do have (accepted_hits.bam is coordinate sorted by default). Try adding -r pos with accepted_hits.bam and you'll likely get more or less identical results.

ADD COMMENT • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon for your answers. I submitted the jobs for htseq-count with -r pos parameters. Currently, the jobs are running. I have another query,

Why the reads count 68,900,778 from htseq-count is not matching to either the trimmed reads count (66,113,117 trimmed reads pairs (R1 and R2))

or

the aligned reads count (Unique reads: 56,594,380 or 131,374,951) from bam file?

$samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
Unique Reads: 56,594,380
$samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by nalandaatmi ▴ 100

0

Entering edit mode

Tophat may output multiple alignments for a given read/pair. htseq-count won't include them in any real counts (they'll be in "__alignment_not_unique"). You don't want to count reads, but fragments, so there are ~57 million of them.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

Yeah that make sense, there are ~57 million unique fragments. So in that case, a fragment can be single read or combination of multiple reads.

The below details are from this post.

2) After aligning the trimmed reads using tophat

TopHat Output align summary:

Left reads: (Left reads alone aligned)
          Input     :  66,113,117
           Mapped   :  54,521,571 (82.5% of input)
            of these:   4,509,697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads: (Right reads alone aligned)
          Input     :  66,113,117
           Mapped   :  53,610,826 (81.1% of input)
            of these:   4,412,079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate. (is average of 82.5% and 81.1%)

Aligned pairs:  51,538,017 (Both left and right reads as a pair aligned)
     of these:   4,275,585 ( 8.3%) have multiple alignments
                  780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate. 

Left reads Input - Mapped reads = 11,591,546 reads (i.e 17.5% is not mapped from left)

Right reads Input - Mapped reads = 12,502,291 reads (18.9% is not mapped from right)

TopHat Output Prep_reads_info:

left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886
right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354

left_reads_in - left_reads_out = 33,231 reads (what happened to these reads?)

right_reads_in - right_reads_out= 42,763 (what happened to these reads?)

Both left reads input and Left_read_in have 66,113,117 sample value,

Questions:

What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?
What is the difference between right_reads_out (66,070,354) and right reads (53,610,826)?
What is the difference between 81.8% overall read mapping rate and 76.8% concordant pair alignment rate?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by nalandaatmi ▴ 100

0

Entering edit mode

left_reads_out/right_reads_out are what remain after tophat filters for length. You can google "concordant alignment" for the rest of your question.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k