ENSEMBL annotation file for quantification: which file to use?
2
0
Entering edit mode
5.6 years ago
psm ▴ 130

Hello, question regarding gene quantification for RNAseq: I've used HISAT2 to align my reads against the hg38 genome, and used UCSC annotation for this.

I now want to perform gene-level quantification using featureCounts. On the Ensembl website (ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens), there are many options for GTFs:

Homo_sapiens.GRCh38.93.chr.gtf.gz
Homo_sapiens.GRCh38.93.chr_patch_hapl_scaff.gtf.gz Homo_sapiens.GRCh38.93.gtf.gz Homo_sapiens.GRCh38.93.abinitio.gtf

What is the difference between these and which should I choose?

Also, the original alignment was done using UCSC gtf, would it be acceptable to then count using the Ensembl annotation? I want to switch because of this paper

Many thanks in avance for any help.

RNA-Seq ensembl annotation • 4.8k views
ADD COMMENT
1
Entering edit mode

It is important to make sure the chromosome identifiers are the same between your fasta reference and your gtf annotation. If one uses chr1 and the other just 1 then you have a problem.

ADD REPLY
0
Entering edit mode

Thank you for that pointer - noted. Thankfully, if I understand correctly, Devon Ryan has pointed out that FeatureCounts is not impaired by this for UCSC and Ensembl chromosome names.

ADD REPLY
8
Entering edit mode
5.6 years ago
Ben_Ensembl ★ 2.4k

Just to compliment Devon Ryan's answer:

.gtf: This is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions. All species have one.

.chr.gtf: Contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included).

.chr_patch_hapl_scaff: Contains all annotation on all toplevel sequences, including patch and haplotype regions. It should only exist for human and mouse

Species with no chromosomes will have a single file, .gtf Species with only chromosomes but no scaffolds will have a single file, .gtf Species with chromosomes and scaffolds will have two files, .gtf and .chr.gtf

Further information can be found in the README file: http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/README

ADD COMMENT
0
Entering edit mode

Would it be possible to add that to the README file? The ab initio file is mentioned, but the other two aren't.

ADD REPLY
1
Entering edit mode

Sure- I'll talk with my colleagues who are responsible for the README file and see whether we can update it to make it more comprehensive.

ADD REPLY
0
Entering edit mode

Very helpful! Thanks for clarifying the different formats.

ADD REPLY
3
Entering edit mode
5.6 years ago

You're lucky that featureCounts can translate between UCSC and Ensembl chromosome names, most tools can't. So you should use Homo_sapiens.GRCh38.93.gtf.gz (using the chr_patch_hapl_scaff file won't hurt, it just contains contigs absent from your reference genome).

ADD COMMENT
0
Entering edit mode

Thanks for breaking it down for me - that's exactly what I wanted to know.

ADD REPLY

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6