Question

GAPDH missing from htseq-counts file

0

Entering edit mode

6.5 years ago

melia_g • 0

I have some RNAseq data from rat neurons which was processed using STAR and HTseq. I noticed earlier today that for all samples, there were fewer than 20 copies of GAPDH, which should be highly expressed, in the HTSeq count files. There may be other high count genes missing, but I haven't been able to find any other anomalies so far and there are genes such as Map2, Calm1/2 and Nrgn with counts in the tens of thousands range.

I've been trying to trace back the source of this error, starting with one of the samples as a test case, and found in the STAR BAM file (converted from SAM) that there were ~15000 alignments to the GAPDH co-ordinates, trying two slightly different sets of co-ordinates for GAPDH in the rat genome from Ensembl and NCBI.

samtools view s10.bam 4:157676336-157680322 | wc -l

Yet for the same sample, the count of uniquely aligned reads in HTSeq is just 9.

python -m HTSeq.scripts.count -s no -a 10 ~/Star_cleanreads/S10.sam Rattus_norvegicus.Rnor_6.0.89.gtf > s10.count grep ENSRNOG00000018630 s10.count

I'm assuming that the issue is the GAPDH transcripts are for some reason aligning to multiple locations and have tried to confirm this by running HTSeq with the "--nonunique all" option, but get the error that this option doesn't exist. I can't find the version number for HTSeq but assume I must be running an old version. I've also had a look at GAPDH in the gtf file and can't see an obvious problem there.

Has anyone else run into this issue and/or can advise how to proceed?

Many thanks

htseq gapdh nonunique • 1.9k views

ADD COMMENT • link updated 3.6 years ago by Biostar 20 • written 6.5 years ago by melia_g • 0

Kevin Blighe · Answer 1 · 2017-10-24

0

Entering edit mode

6.5 years ago

Kevin Blighe 87k

Are you sure that GAPDH is expressed in high quantities in the tissue that you're researching?

Far from what people believe, the traditional housekeeping genes (including GAPDH) don't in fact show ubiquitous and equal expression patterns across tissues - far from it.

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi, thanks for taking the time to respond. I did have a quick look at the literature first and yes GAPDH does seem to be expressed in CA1 neurons. Regardless, there are many alignments to the GAPDH co-ordinates in the SAM file which is what makes me think there's an issue with the counting.

In the history I found a post about a similar issue with Cufflinks, apparently due to the max number of fragments per locus parameter and wondered if it could be something similar, but I've had no luck so far ( https://biostar.usegalaxy.org/p/12372/ ).

ADD REPLY • link updated 6.5 years ago by Kevin Blighe 87k • written 6.5 years ago by melia_g • 0

0

Entering edit mode

Thanks for getting back. That could certainly be the cause, and it would be exacerbated by the fact that GAPDH is a relatively small gene. It also sits in very close proximity to other genes, which may prove probematic for HTSeq, I read:

We find that if one uses HTSeq counts as input, the maximum achievable sensitivity is lower than for the other methods. This is due to the fact that for overlapping gene symbols HTSeq does not count the reads and for those genes differential expression cannot be positive.

source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879183/

I can neither confirm that that is the issue though.

Might I recommend an alignment-free RNA-seq quantification program like Kallisto? The reference is the genomic transcriptomic sequences for each transcript (FASTA), over which abundance is calculated directly from the FASTQ/A reads. You may see improvement in the problem that way.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you so much! I'm getting a compilation error with Kallisto, but quantifying with Salmon recovers GAPDH. For the sample above: ENSRNOT00000050443.4, TPM: 2302.93; Counts: 10686.1