This site is a beta test.
Question: GTF file for reference
0
Entering edit mode
2.1 years ago
KVC_bioinfo • 390
Boston

Hello,

I have downloaded the reference for alignment of RNA-Seq with human transcriptome formThis link. I downloaded RefSeq transcripts from the link to use as a reference. I was not sure how do I get GTF file for this reference. I posted that question on Bio-stars a few days ago and I got an answer that I should download it from the UCSC table browser. So, I accordingly downloaded it from that source.

However, the GTF from table browser has sam egene_id and transcript_id which is not suitable for analysis using HTSeq So, I have a couple of questions here.

  1. what should I do in this case? I feel unsafe to edit GTF file
  2. Is there any other way to get GTF for specific reference I am looking for which will be compatible with HTSeq?
ADD COMMENTlink 2.1 years ago KVC_bioinfo • 390 • updated 2.1 years ago genecats.ucsc • 560
3
Entering edit mode
2.1 years ago
Kevin Blighe 43k
Republic of Ireland

I would highly recommend the GENCODE GTF, whose information fields contain the gene symbols that you want. I am almost certain that it is compatible with HTSeq.

See here: http://www.gencodegenes.org/releases/current.html

[be sure to download the correct GTF for your genome build (GRCh37/hg19 or GRCh38/hg38)]

ADD COMMENTlink 2.1 years ago Kevin Blighe 43k
Entering edit mode
0

Thank you. I am using GRCh38. I followed the link you provided. So can I use "Comprehensive gene annotation" the very first file on that link when the reference used is Human transcriptome(NCBI's RefSeq transcripts)?????

ADD REPLYlink 2.1 years ago
KVC_bioinfo
• 390
Entering edit mode
2

Yes, precisely.

Here is the direct link: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

Here is the first record (DDX11L1 is 'always' the first gene, right at the beginning of the short arm of chr1)

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

ADD REPLYlink 2.1 years ago
Kevin Blighe
43k
Entering edit mode
1

Thank you very much. I was under the wrong impression that the GTF file for Human genome and Human transcriptome is different.

ADD REPLYlink 2.1 years ago
KVC_bioinfo
• 390
Entering edit mode
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink 2.1 years ago
WouterDeCoster
39k
Entering edit mode
0

Thanks for the information!

ADD REPLYlink 2.1 years ago
KVC_bioinfo
• 390
3
Entering edit mode
2.1 years ago
genecats.ucsc • 560

If you would like to "edit" your UCSC Table Browser obtained GTF file, we have provided some utilities to do so: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

The basic gist is to download your table of interest, chop off some columns (may or may not be necessary depending on the specific table), then run the genePredToGtf utility:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg19 | \
cut -f2- | genePredToGtf -source=hg19.refGene.ucsc file stdin stdout

Change stdout to the output filename you want in the last command to get an hg19 refGene GTF file:

chr1    hg19.refGene.ucsc   transcript  11869   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357";  gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    11869   12227   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "1"; exon_id "NR_148357.1"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    12613   12721   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "2"; exon_id "NR_148357.2"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    13221   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "3"; exon_id "NR_148357.3"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   transcript  11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";
...

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

  • General questions: genome@soe.ucsc.edu
  • Questions involving private data: genome-www@soe.ucsc.edu
  • Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD COMMENTlink 2.1 years ago genecats.ucsc • 560

Login before adding your answer.

Powered by the version 1.5.2