Hello everyone,
I am using htseq-count (version 0.9.1) on single-end RNA seq data. I did the analysis with the option --nonunique none and --nonunique all, however I obtained different results.
non unique none
htseq-count -f bam -a 0 -s no --nonunique none -m union -r pos -t gene -i Name No_treated_qual_MPOB_sort.bam my_file.bam my_file.gff3 > output_nonunique_none.txt
non unique all
htseq-count -f bam -a 0 -s no --nonunique all -m union -r pos -t gene -i Name No_treated_qual_MPOB_sort.bam my_file.bam my_file.gff3 > output_nonunique_none.txt
In both cases I have 17 344 033 SAM alignment pairs processed.
But the result of __no_feature and __ambiguous is different between these 2 command line. I don't understand why ?
To my opinion when I read this manual, I should obtain the same number of __no_feature and __ambiguous with the 2 command line and I should obtain a different number of count for reads which are mapping against a gene.
Result with non unique none
__no_feature 3198841
__ambiguous 187272
__too_low_aQual 0
__not_aligned 1582688
__alignment_not_unique 1449930
sum_of_reads_map_against_gene 10925302
total 17344033
Result with non unique all
__no_feature 4114357
__ambiguous 203260
__too_low_aQual 0
__not_aligned 1582688
__alignment_not_unique 1449930
sum_of_reads_map_against_gene 11851412
total 19201647
So if someone can explain me why do I obtain more no feature reads and ambiguous reads in the case of the --nonunique all ?
I don't undestand in the manual, it says : __alignment_not_unique : reads (or read pairs) with more than one reported alignment.
That means reads can map on 2 features or more not 0 (no_feature) and 1.
Let me clarify:
A read can align to multiple sections of the genome, generating multiple alignments. When you include those reads in the analysis, all their possible alignments will be compared to the gene annotation and an alignment can correspond to 0 (no feature), 1(map against gene) or more(ambiguous) features.
To summarize:
Usually, multimappers (reads with multiple alignments) are omitted from the analysis because you don't know from which section of the genome it comes from.
Ok, now it's clear, thank you for your answer