GTF file for reference
2
0
Entering edit mode
6.5 years ago
KVC_bioinfo ▴ 590

Hello,

I have downloaded the reference for alignment of RNA-Seq with human transcriptome formThis link. I downloaded RefSeq transcripts from the link to use as a reference. I was not sure how do I get GTF file for this reference. I posted that question on Bio-stars a few days ago and I got an answer that I should download it from the UCSC table browser. So, I accordingly downloaded it from that source.

However, the GTF from table browser has sam egene_id and transcript_id which is not suitable for analysis using HTSeq So, I have a couple of questions here.

  1. what should I do in this case? I feel unsafe to edit GTF file
  2. Is there any other way to get GTF for specific reference I am looking for which will be compatible with HTSeq?
RNA-Seq gtf refseq • 4.0k views
ADD COMMENT
4
Entering edit mode
6.5 years ago

I would highly recommend the GENCODE GTF, whose information fields contain the gene symbols that you want. I am almost certain that it is compatible with HTSeq.

See here: http://www.gencodegenes.org/releases/current.html

[be sure to download the correct GTF for your genome build (GRCh37/hg19 or GRCh38/hg38)]

ADD COMMENT
0
Entering edit mode

Thank you. I am using GRCh38. I followed the link you provided. So can I use "Comprehensive gene annotation" the very first file on that link when the reference used is Human transcriptome(NCBI's RefSeq transcripts)?????

ADD REPLY
2
Entering edit mode

Yes, precisely.

Here is the direct link: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

Here is the first record (DDX11L1 is 'always' the first gene, right at the beginning of the short arm of chr1)

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

ADD REPLY
1
Entering edit mode

Thank you very much. I was under the wrong impression that the GTF file for Human genome and Human transcriptome is different.

ADD REPLY
1
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Thanks for the information!

ADD REPLY
3
Entering edit mode
6.5 years ago
genecats.ucsc ▴ 580

If you would like to "edit" your UCSC Table Browser obtained GTF file, we have provided some utilities to do so: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

The basic gist is to download your table of interest, chop off some columns (may or may not be necessary depending on the specific table), then run the genePredToGtf utility:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg19 | \
cut -f2- | genePredToGtf -source=hg19.refGene.ucsc file stdin stdout

Change stdout to the output filename you want in the last command to get an hg19 refGene GTF file:

chr1    hg19.refGene.ucsc   transcript  11869   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357";  gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    11869   12227   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "1"; exon_id "NR_148357.1"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    12613   12721   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "2"; exon_id "NR_148357.2"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    13221   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "3"; exon_id "NR_148357.3"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   transcript  11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";
...

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

  • General questions: genome@soe.ucsc.edu
  • Questions involving private data: genome-www@soe.ucsc.edu
  • Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD COMMENT

Login before adding your answer.

Traffic: 2039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6