get gene biotype for hg38 refseq
2
1
Entering edit mode
4.9 years ago
idaliu613 ▴ 10

how do I query the tables (https://genome.ucsc.edu/cgi-bin/hgTables) to get a gtf file from ucsc that has gene biotype? The biotypes can come from Ensembl. But I want an annotation file with biotypes, if possible. (:

refseq hg38 gene biotype • 3.2k views
ADD COMMENT
1
Entering edit mode
4.5 years ago
vkkodali_ncbi ★ 3.7k

Since you have tagged the post with 'refseq', I am assuming you are interested in RefSeq annotation. If that is the case, I suggest you download the relevant files directly from NCBI FTP site. The GTF and GFF3 files for RefSeq annotation include gene_biotype information in column 9.

ADD COMMENT
0
Entering edit mode

Hi vkkodali,

I have the same question, I followed your answer and download hg19_gtf file but in column 9, there is only gene id, not gene_biotype (screen capture link: https://drive.google.com/file/d/1OkpcDF_u2-yzAKlg8s46vIVOi8pc4AZ1/view?usp=sharing )

ADD REPLY
1
Entering edit mode

Please download the GTF from GENCODE: https://www.gencodegenes.org/

ADD REPLY
1
Entering edit mode

Please download data from NCBI RefSeq FTP site, not UCSC. For hg19, you can search for GRCh37 in NCBI Assembly portal to get to this page. Once you are there, click on the 'Download Assembly' button, choose 'RefSeq' as source database and GTF as your file type. You will end up downloading a tarball with the GTF file. Alternatively, you can go to the FTP path directly by clicking on the 'FTP directory for RefSeq assembly' link on the right-hand bar and choose the file of interest to you.

ADD REPLY
1
Entering edit mode

^^ this can work, too.

ADD REPLY
1
Entering edit mode
4.0 years ago

Just adding for other users who land on this page.

Another solution is to simply generate a 'master' table in biomaRt:

require('biomaRt')

mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)

Check that it is indeed GRCh38:

searchDatasets(mart = mart, pattern = 'hsapiens')
                 dataset              description    version
78 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

Now generate the table:

annotLookup <- getBM(
  mart = mart,
  attributes = c(
    'hgnc_symbol',
    'ensembl_gene_id',
    'refseq_mrna',
    'refseq_ncrna',
    'gene_biotype'),
  uniqueRows = TRUE)


head(annotLookup)
  hgnc_symbol ensembl_gene_id refseq_mrna refseq_ncrna   gene_biotype
1       MT-TF ENSG00000210049                                 Mt_tRNA
2     MT-RNR1 ENSG00000211459                NR_137294        Mt_rRNA
3       MT-TV ENSG00000210077                                 Mt_tRNA
4     MT-RNR2 ENSG00000210082                NR_137295        Mt_rRNA
5      MT-TL1 ENSG00000209082                                 Mt_tRNA
6      MT-ND1 ENSG00000198888                          protein_coding

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6