Question

Inconsistency between htseq-count output sorted by position or by name

1

Entering edit mode

7.7 years ago

lolec ▴ 20

Hello everyone!

I am using htseq-count (version 0.6.1p1) on paired-end RNA seq data. I did the analysis sorting by name and sorting by position, however I obtained different results.

By name:

samtools sort -n accepted_hits.bam -o accepted_nsorted.bam
python -m HTSeq.scripts.count -f bam -r name -s reverse accepted_nsorted.bam GRCh37.75.gtf > htseq_accepted_nsorted.txt

tail htseq_accepted_nsorted.txt
ENSG00000273487 2
ENSG00000273488 2 
ENSG00000273489 3
ENSG00000273492 5
ENSG00000273493 1
__no_feature    1372372
__ambiguous     820117
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  1836341

By position:

samtools sort accepted_hits.bam -o accepted_psorted.bam
python -m HTSeq.scripts.count -f bam -r pos -s reverse accepted_psorted.bam GRCh37.75.gtf > htseq_accepted_psorted.txt

Message:

Warning: Mate pairing was ambiguous for 23096 records; mate key for first such record: ('HWI-1KL149:106:C8CUGACXX:5:2102:5222:87333', 'first', '1', 146410, '1', 146469, 9416).

29201104 SAM alignment pairs processed.

tail htseq_accepted_psorted.txt
ENSG00000273487 2
ENSG00000273488 2
ENSG00000273489 3
ENSG00000273492 5
ENSG00000273493 1
__no_feature    1467721
__ambiguous     866870
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  2028312

What is wrong in this analysis ? How can we explain the different counts obtained ? Is there a solution to this problem ?

RNA-Seq htseq-count • 3.2k views

ADD COMMENT • link 7.7 years ago by lolec ▴ 20

1

Entering edit mode

Looks weird. Probably related to the warning you received.

See htseq-count manual:

-r <order>, --order=<order>¶ For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.

Makes me guess that the names of your read (pairs) are not unique? Just something I would check. How did you obtain the bam?

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

There is no 'warning' when sorted by name. Then I don't thing the problem is coming from reads' name. The bam was obtained with TOPHAT2.

ADD REPLY • link 7.7 years ago by lolec ▴ 20

0

Entering edit mode

I wouldn't expect a warning when sorting by name, since in that case htseq-count expects the both pairs to be adjacent. If sorted by position, htseq-count will search for the other read of the pair (to perform the mate pairing). And this gave a warning as you can see. Maybe it's the best to contact the author of the tool.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Okay, I will do that. Thank you for your help.

ADD REPLY • link 7.7 years ago by lolec ▴ 20

0

Entering edit mode

Did you manage to find any solution? (thanks in advance)

ADD REPLY • link 7.3 years ago by al-ash ▴ 200

0

Entering edit mode

Unfortunately no. It might be inherent to how HTSEQ-count works.

ADD REPLY • link 7.3 years ago by lolec ▴ 20