Biostar Beta. Not for public use.
featureCount vs GENCODE gene lengths
0
Entering edit mode
14 months ago
user31888 • 60
United States

I used featureCount for counting reads in a bam file.

featureCount notably outputs the length of each gene.

However, I noticed huge differences in number of gene exons and length between the 'hg38_RefSeq_exon.txt" annotation file provided with featureCount and the last version of GENCODE gtf. Why is that?

Maybe the following approach is not correct:

(1) keeping exon features only in GENCODE v28 GRCh38.p12 gtf

(2) sort the gtf subset by chromosome and start position:

chr1    11869   12227   ENSG00000223972
chr1    12010   12057   ENSG00000223972
chr1    12179   12227   ENSG00000223972
chr1    12613   12697   ENSG00000223972
chr1    12613   12721   ENSG00000223972
chr1    12975   13052   ENSG00000223972
...

(3) get the union of overlapping exons from the same genes:

> bedtools merge -i <sorted_gtf_subset> -d 0 -c 4 -o distinct
chr1    11869   12227   ENSG00000223972
chr1    12613   12721   ENSG00000223972
chr1    12975   13052   ENSG00000223972
chr1    13221   14501   ENSG00000223972,ENSG00000227232
chr1    15005   15038   ENSG00000227232
chr1    15796   15947   ENSG00000227232
chr1    16607   16765   ENSG00000227232
chr1    16858   17055   ENSG00000227232
...

(4) for each gene, sum the difference between start / end position of merged exons.

For example, featureCount outputs a length of 20,654 for ENSG00000237298 (RefSeq ID: 100506866) with 22 exons in its annotation file 'hg38_RefSeq_exon.txt'.

Using the method above with GENCODE gtf, I get 63 exons for this same gene and a length of 62,661 (difference of 42,007 bases !!!). I also noticed that 50 of these exons are also shared with other genes (50 exons with ENSG00000155657 and 1 with ENSG00000270956)

Could someone explain this difference to me?

ADD COMMENTlink
0
Entering edit mode

Have you examined the GENCODE GTF file carefully? It is possible that there are overlapping annotations in that regions that is leading to the much higher count using your method.

As you point out the RefSeq entry is clear with 22 exons marked in it and the longest transcript included in ENSG00000237298 also points to that same entry.

ADD REPLYlink
0
Entering edit mode

I have just used the same approach with the latest RefSeq gff3. It gives the exact same 22 exons as in featureCount. I don't see what other filter I could apply to the GENCODE gtf (the source maybe?)

ADD REPLYlink
3
Entering edit mode
13 months ago
EMBL-EBI
  1. It's an antisense gene to Titin. It's a whopper. As an antisense gene, it, of course, shares exonic regions (though on the opposite strand) with Titin (ENSG00000155657), the gene it is antisense to because that's what an antisense gene is. It doesn't share exons with ENSG00000270956 from what I can see.

  2. RefSeq and Ensembl annotation is fundamentally different. The manual gene annotation done in Ensembl aims to be as comprehensive as possible, annotating all transcripts for which there is evidence. This particularly affects non-coding transcripts, for which there is evidence that transcription is occurring pretty much everywhere. RefSeq annotation is more conservative, so they miss lots of non-coding transcripts.

ADD COMMENTlink
0
Entering edit mode

Thanks Emily ! I understand now.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1