Htseq-Count Error With Strand-Specific Data
2
2
Entering edit mode
11.1 years ago

Hi,

I have 2x50 bp human strand-specific data. I aligned them against hg19 with STAR and after that I extracted the read count per gene with htseq-count. I used refseq genes from UCSC.

But the results are quite strange... Here is a snapshot of my data. The reads shown in blue are aligning to the sense strande of the reference. SERF2 is on the sense strand and MIR1282 is on the anti-sense strand. htseq-count output:

SERF2   0
MIR1282 783

and I used :

htseq-count -s yes -i gene_id - $gtf_refseq > htseq_refseq.txt

I don't understand where my error is.

Has anyone tried htseq-count with STAR?

enter image description here

Here are the gtf lines of interest :

chr15    hg19_refGene    exon    44085857    44085957    0.000000    -    .    gene_id "NR_031695"; transcript_id "NR_031695"; 
chr15    hg19_refGene    start_codon    44084575    44084577    0.000000    +    .    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    CDS    44084575    44084581    0.000000    +    0    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    exon    44084040    44084581    0.000000    +    .    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    CDS    44085173    44085281    0.000000    +    2    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    exon    44085173    44085281    0.000000    +    .    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    CDS    44085908    44085968    0.000000    +    1    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    stop_codon    44085969    44085971    0.000000    +    .    gene_id "NM_001018108"; transcript_id "NM_001018108"; 
chr15    hg19_refGene    exon    44085908    44088287    0.000000    +    .    gene_id "NM_001018108"; transcript_id "NM_001018108";

Another strange thing. When it's an isolated gene (not overlapped by an other gene) like this one in the picture below, there are only 8 reads reported by htseq-count while there are a lot more (~40 in average per position) shown in IGV.

enter image description here

thanks

N.

• 7.2k views
ADD COMMENT
2
Entering edit mode
11.1 years ago

It is possible those are being counted as ambiguous or alignment_not_unique by htseq-count. If you look at the end of your htseq-count results do you see something like this?

no_feature 66768341

ambiguous 3950127

too_low_aQual 0

not_aligned 0

alignment_not_unique 7214789

Those are the last five lines from an htseq-count library that I was just looking at (first time using htseq-count). As you can see, I was getting a lot of reads in those categories. Maybe your data is not being recognized as stranded and therefore MIR1282 has no unique reads. However, even if that explains why you get 0 counts for MIR1282 I'm not sure why you aren't getting anything for SERF2. Is it possible that even where it doesn't overlap with MIR1282 the reads are ambiguously mapping to multiple other locations? Like a pseudogene? Reads being counted in these categories might also explain why you have a discrepancy for your second gene example? Maybe you have duplicate reads being shown in IGV which are not being counted by htseq-count for that gene but instead being put in the alignment_not_unique bin? I'm just guessing here. Do values seem sensible for most other genes? Or, do they all seem problematic?

Maybe you are having a similar problem as was Difference Between Htseq-Count And Bedtools Multicov. Looking at your GFF file it seems that your gene_id = transcript_id. Therefore you are effectively doing only transcript-level analysis. With the default htseq-count mode (union) any read that overlaps more than one transcript will be dropped into the ambiguous bin. For genes with multiple transcripts that have almost no unique exon content, there maybe no reads that don't overlap more than one transcript and therefore counts drop to 0. Try getting a GFF file that actually distinguishes between transcripts and genes. I suggest Ensembl GTF. And maybe try some of the other htseq-count modes as explained in the software documentation such as "intersection-strict".

ADD COMMENT
0
Entering edit mode

My gtf is coming from UCSC, so the first problem (overlapping genes) is coming from there. But the second one (the gene alone), I checked the reads and the are all good (alignment score : 255, NH:1 ).. I check the htseq outut and it gives me a lot of alignmentnotunique..

ADD REPLY
0
Entering edit mode

Again, I think what is happening is that your GTF does not distinguish between transcripts and genes. So, htseq-count treats all transcripts as if they were different genes. Thus it throws away not just reads that are non-unique because of overlapping genes but also if they are non-unique because of overlapping transcript isoforms for the same gene. I don't see any way around that other than using a different GTF file.

ADD REPLY
0
Entering edit mode

I redid it with ensembl gtf (from ensembl, not UCSC) and I have the same behaviour..

ADD REPLY
0
Entering edit mode

But when I put -s no (so htseq-count will not take the strand into account). The gene "alone" (cf figure 2), it gives me good read count (~ 600). With -s yes : read count = ~9

ADD REPLY
0
Entering edit mode

That's weird. Sounds like this really is a problem with how htseq-count is interpreting stranded info from STAR.

ADD REPLY
0
Entering edit mode

So, you are still not getting any counts for SERF2? Could you put a snippet of your ensembl gtf and the command you ran and results you get for those genes above. Just to complete the story of what you have tried so far. I am just trying to figure out htseq-count myself. Maybe we should try to invite the author to comment on this thread?

ADD REPLY

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6