Question

Single-cell RNA-seq coverage track not corresponding to htseq-count output or IGV gene annotations?

2

Entering edit mode

8.2 years ago

shanasabri ▴ 40

I have some single-cell RNA-seq data that I've processed via a tophat2 --> htseq-count pipeline, and I've also generated some bed graphs via bedtools genomecov. I have stumbled upon two issues that I don't seem to understand.

Issue 1

I've noticed that the htseq-count and bedgraphs for this data is not necessarily in agreement. For example, we see peaks on IGV with cells that have 0 counts according to htseq. An example of this can be seen in cells H11 and G11. According to htseq-count, H11 has 0 hits on Xist while G11 has 320. If we look at the IGV tracks for these cells we see peaks on both tracks despite the 0 in H11 (please see attached screenshot below; tracks are auto-scaled). My reasoning for this is because htseq-count does not account for multi mapped reads and these may be the read showing up on IGV. Also the bed graphs are not normalized/scaled. Does anyone have other insight? In my pipeline I've treated all cells as non-strand-specific (tophat2 argument --library-type fr-unstranded and htseq-count argument --stranded=no), which I believe is correct given the library prep conditions. The pipeline is briefed below. I should mention that no errors or warning were thrown thought the pipeline. Please let me know if anything strikes you as incorrect.

Pipeline

IGV screenshot 1

Issue 2

So in my RNA-seq data sometimes I see reads that do not correspond to an annotated gene/region, so let's say these are "New genes".

Would I see such regions in other published data, or did they somehow only provide coordinates for regions that are annotated? Because, just looking in IGV, I see no such regions in other published data, so I'm wondering if it could be because of the way they deposited their data.

Another thought is that these 'unannotated' regions are actually annotated but using USCS or other gene tracks.

Please refer to the screen shot posted below.

IGV screenshot 2

RNA-Seq htseq-count rna-seq IGV coverage • 4.5k views

ADD COMMENT • link updated 8.2 years ago by Istvan Albert 100k • written 8.2 years ago by shanasabri ▴ 40

score 2 · Answer 1 · 2016-02-29

Counting reads that overlap a region can be more complicated problem than what it initially sounds like. It all comes down to what to do when there is a chance of double. triple etc counting a read. For example a multimapped read, or when regions overalap, as you yourself also noted.

Tools like genome coverage will add up the reads that overlap a feature - and they may (though I have to check) even count reads that bridge over that position (say with skipped bases). On the other hand tools like htseq and featurecounts are more sophisticated and have a series of rules of when to add these up. Read up on their documentation. In general though I will say that it is surprisingly difficult to manually replicate what each of these counting tools do.

As for your second question, it is very common to observe transcripts coming from seemingly unannotated regions. Many regions of the genome may be transcribed into RNA though most won't likely be protein coding (especially for already well annotated species)