Update

Question

Which annotation file to use

0

Entering edit mode

6.0 years ago

DVA ▴ 630

What annotation (GTF) file should I use for strand-specific 3' mRNA sequencing? I am trying to use FeatureCounts with 3' mRNA sequencing data. I think I need an annotation file (for Hg38) that only covers mRNA and is strand specific. Can someone recommend where to get such a file?

My code:

featureCounts -T 14 -a $hg38_uscs.gtf -o $sample.featureCount.txt sample.sort.bam

I have tried to use UCSC annotation table GENCODE24, but most of my reads cannot be counted via Feature Counts, and got called as "ambiguity", because (to my knowledge) they were aligned to regions that has more than one features. Is this a common issue...? My output summary.txt

Assigned    2030063
Unassigned_Unmapped 7166618
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 1302741
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_NoFeatures   583039
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    7609437

Update

I tried the annotation file downloaded with featureCounts. It is in SAF format, and I tried

featureCounts -T 14 -F 'SAF' -a $hg38_uscs.gtf -o $sample.featureCount.txt sample.sort.bam

And the aligned reads significantly improved! I am looking at the differences of the two files, but still, if anyone has any suggestions, I would love to know. Thank you.

RNA-Seq • 4.4k views

ADD COMMENT • link updated 6.0 years ago by h.mon 35k • written 6.0 years ago by DVA ▴ 630

0

Entering edit mode

Can you post the exact command you've used with featureCounts as well as the *summary file that it generates?

ADD REPLY • link 6.0 years ago by Friederike 8.9k

0

Entering edit mode

Thank you for the reply. I have edited the post with an update.

ADD REPLY • link 6.0 years ago by DVA ▴ 630

score 3 · Accepted Answer · 2018-05-01

3

Entering edit mode

6.0 years ago

h.mon 35k

I think I need an annotation file (for Hg38) that only covers mRNA and is strand specific.

I believe all transcribed features annotated on the human genome are "stranded", that is, the annotation contains information if the feature maps to the forward (+) or reverse (-) DNA reference strand - so the annotation you are using is already "stranded".

The problem is with your featureCounts settings: if your RNAseq is strand-specific, you have to use featureCounts -s something:

-s <int>            Perform strand-specific read counting. Acceptable values:
                    0 (unstranded), 1 (stranded) and 2 (reversely stranded).
                    0 by default.

Note the genome annotation strand (feature maps to the forward or reverse DNA reference strand) is not the same thing as the featureCounts strandedness setting. featureCounts strand is related to the strand of cDNA construction being sequenced. As your sequencing is single end, stranded is when the reads are in the same strand of the original RNA molecule being transcribed, and reverse is when the reads are on the opposite strand to the original RNA molecule being transcribed.

If you use the default setting of unstranded for stranded RNAseq, the better the annotation, the more ambiguity featureCounts will find.

ADD COMMENT • link 6.0 years ago by h.mon 35k

0

Entering edit mode

Thank you for the answer! I was looking at -s just now. What exactly is "stranded" and "reversely stranded"...? Is it "+" and "-" respectively?

And seems like the ucsc version is just so thorough and provides lots overlaps that causes ambiguity counts. Is my understanding right please?

ADD REPLY • link 6.0 years ago by DVA ▴ 630

1

Entering edit mode

I updated my answer to explain what strand means for featureCounts.

And seems like the ucsc version is just so thorough and provides lots overlaps that causes ambiguity counts.

Yes, I believe this is the case, but only so because you unknowingly used the default unstranded setting, instead of choosing the correct setting.

ADD REPLY • link 6.0 years ago by h.mon 35k

0

Entering edit mode

I tried to use -s 1. It helps a little bit when I use the uscs GENECODEv24 or use the txt provided by feature count, like 1%. What is your experience in doing this please? Seems like the most important thing is to use a not so thorough annotation file... Does this sound right to you? Thanks so much for the help.

ADD REPLY • link 6.0 years ago by DVA ▴ 630

0

Entering edit mode

No, it doesn't, one should use the best annotation possible. Assuming your stranded library preparation is good and you have used the correct strand setting in featureCounts, your Unassigned_Ambiguity is abnormally high. So you should check your sequencing quality for issues.

I have seen it has been suggested you should use featureCounts -s 2 ( High Unassigned Ambiguity in Feature Counts ). Why did you post a very similar question, without following the advice given previously?

edit: your Unassigned_Unmapped is also quite high for human RNAseq.

ADD REPLY • link 6.0 years ago by h.mon 35k

0

Entering edit mode

Thank you very much for looking into the details. I opened this new post when I was not sure why annotation file make a huge difference, and thought it would be a different/quick question. I apologize and I will also make a note in the previous post. I did tried both -s 2 and -s 1, and -s 2 gives very low assigned reads. -s 1 on the other hand works well with feature counts' txt annotation file, and I confirmed with the library prep procedure that our reads are with the same orientation as mRNA. These all make sense now.

However, it still does not explain why a uscs annotation file is not working with -s 1. Is that possible that many of reads just happen to align to the regions, where there are genes overlapping each other?

The high unassigned unmapped reads is expected, because our library prep had some issue and many reads are not alignable to reference. But thanks a lot for looking into this.

ADD REPLY • link 6.0 years ago by DVA ▴ 630

0

Entering edit mode

to better understand the issue, it might be worthwhile to identify one or two genes that show particularly large discrepancies in the read counts after either SAF or GTF usage.

Then wrangle the SAF format file into a BED file (column order: chromosome, start, end, gene name, arbitrary score, strand) and load the BAM file, the BED file and the GTF file into a Genome Browser, e.g. IGV and have a look at those example genes. Feel free to share the snapshot here, I'd be interested to see it.

ADD REPLY • link 6.0 years ago by Friederike 8.9k

0

Entering edit mode

Thank you! This is very helpful! I'm trying to figure out a way to actually visualize what is going on. I will post the update soon.

ADD REPLY • link 6.0 years ago by DVA ▴ 630