This site is a beta test.
Question: Where can I get a breakdown of RefSeq annotation statistics?
1
Entering edit mode
21 months ago
b10hazard • 20
United States

GenCode has a wonderful breakdown of the number of coding/non-coding genes and transcripts here...

http://www.gencodegenes.org/stats.html

Does a similar breakdown exist for RefSeq anywhere? Thanks!

ADD COMMENTlink 21 months ago b10hazard • 20 • updated 21 months ago genomax 68k
3
Entering edit mode
21 months ago
genomax 68k
United States

If you take a look at this hg19/GRCh37 GFF file from NCBI here is what is in there:

495245 CDS
   1 D_loop
33314 Genomic
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
   4 Y_RNA
  22 antisense_RNA
19086 cDNA_match
618452 exon
30661 gene
6405 lnc_RNA
49861 mRNA
22621 match
3097 miRNA
  31 ncRNA
2046 primary_transcript
  23 rRNA
 297 region
   1 sequence_feature
  72 snRNA
 393 snoRNA
 698 tRNA
   1 telomerase_RNA
5732 transcript
   3 vault_RNA

For GRCh38 a similar file can be found here with following statistics:

1413848 CDS
   1 D_loop
33893 Genomic
  26 Genomic%2CXM/XP/XR
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
  15 V_gene_segment
   4 Y_RNA
  22 antisense_RNA
13572 cDNA_match
  24 centromere
1856502 exon
43504 gene
28117 lnc_RNA
114575 mRNA
22834 match
3038 miRNA
  31 ncRNA
2025 primary_transcript
  23 rRNA
 558 region
 304 sequence_feature
  62 snRNA
 389 snoRNA
 629 tRNA
   1 telomerase_RNA
16011 transcript
   3 vault_RNA
ADD COMMENTlink 21 months ago genomax 68k
Entering edit mode
0

Good answer - I was not aware that RefSeq had been compiling this info

ADD REPLYlink 21 months ago
Kevin Blighe
43k
Entering edit mode
0

This is not RefSeq but the files are from NCBI's human genome resource. Should be close enough.

ADD REPLYlink 21 months ago
genomax
68k
Entering edit mode
0

How did you parse that information from the GFF3 file?

ADD REPLYlink 21 months ago
b10hazard
• 20
Entering edit mode
1
cat interim_GRCh38.p10_top_level_2017-01-13.gff3 | awk '{print $3}' | sort | uniq -c
ADD REPLYlink 21 months ago
genomax
68k
Entering edit mode
0

That works! Thanks!

ADD REPLYlink 21 months ago
b10hazard
• 20
2
Entering edit mode
21 months ago
Kevin Blighe 43k
Republic of Ireland

I don't believe RefSeq do as good a breakdown of transcript types as GENCODE, however, you may be interested in the following resources:

Generally, I think that you'll find that GENCODE is more comprehensive for non-coding genes; however, for the majority of these, exact function is entirely unknown. Most people filter them out of, for example, RNA-seq experiments, in order to (in part) minimise the stringency of a false discovery rate threshold. On the other hand, RefSeq has the feel of a well-curated resource.

Kevin

ADD COMMENTlink 21 months ago Kevin Blighe 43k
Entering edit mode
0

What I was really hoping to get was the number of full length non-coding transcripts that refseq has for hg19. Is there anyway to get this information?

ADD REPLYlink 21 months ago
b10hazard
• 20
Entering edit mode
0

The first link above is to a published manuscript where GENCODE transcripts were compared to those of RefSeq. They used GRCh37 / hg19 transcripts. in Additional File 3 of this is a table where they compare GENCODE to RefSeq NR, which according to RefSeq are non-protein coding transcripts (or transcripts unlikely to have protein coding potential).

Additional files can be found here: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2#MOESM6

ADD REPLYlink 21 months ago
Kevin Blighe
43k
Entering edit mode
0

Thanks for your help with this Kevin. Those papers were excellent resources and they covered a lot of ground. I accepted GenoMax's answer mostly because that output file provided a very flexible way for me extract the metrics I was looking for.

ADD REPLYlink 21 months ago
b10hazard
• 20
Entering edit mode
0

You are able to "accept" more than one answer so feel free to accept @Kevin's too.

ADD REPLYlink 21 months ago
genomax
68k
Entering edit mode
0

No problem, b10hazard - that's the nature of the game here. It's not a competition to see who can have the most accepted answers. I was actually about to say that you should accept the answer of GenoMax because it was a greater fit for your question. GenoMax is also much more experienced than I.

Thanks for the diplomacy GenoMax :)

ADD REPLYlink 21 months ago
Kevin Blighe
43k

Login before adding your answer.

Powered by the version 1.5.2