hg38 gtf file containing all RNAs and features
1
0
Entering edit mode
5.7 years ago
CrisMar ▴ 80

Hi All, I annotated a bed file with an hg38 gtf file (from gencode) using bedtools intersect. Annotating bed files with Crosslinking induced truncations sites from iCLIP experiment (chromosome locations).

Although it worked well the attributes column in the hg38 file is a mess.

  1. Trying to separate each attribute in one column with the corresponding attribute turned out to be tricky (for me.)
  2. Not all the attributes I needed are present. For example, I need to eventually show how many 3'UTRs, 5'UTRs are present etc.
  3. I tried downloading a specific gtf file from UCSC browser but cannot get all the RNAs (mRNAs, miRNAs, lnRNAs etc) in one file to do the bedtools analysis.
  4. Trying to use different gtf/gff parser's but none seem to work well (or difficult to use).

Any suggestions appreciated.

-Learning.

gtf annotations parser attributes bedtools • 2.1k views
ADD COMMENT
0
Entering edit mode

Can you show which files you have and what are you trying to get? In general, I wouldn't recommend doing bedtools intersect on a gtf file because bedtools don't really understanf the relations between features like gene -> transcript -> exon and your output file might get very messed up. Definitely check it in Genome Browser and look if all your exons, transcripts are in place, etc.

ADD REPLY
0
Entering edit mode
5.7 years ago
CrisMar ▴ 80

Yes, I have this bed file:

$head CITS.bed

chr1    568974  568975  CITS_1[gene=chr1_f_c24][PH=12][PH0=0.29][P=1.01e-12]   12   +
chr1    2239149 2239150 CITS_2[gene=chr1_f_c1136][PH=7][PH0=0.40][P=2.21e-04]   7   +
chr1    2239899 2239900 CITS_3[gene=chr1_f_c1138][PH=6][PH0=0.21][P=3.56e-04]   6   +
chr1    2461199 2461200 CITS_4[gene=chr1_f_c1237][PH=5][PH0=0.17][P=1.46e-04]   5   +

And I want to get something like this (as a random example) with each attribute in a different column but each column corresponding to one attribute.

chr1    568974  568975  CITS_1[gene=chr1_f_c24][PH=12][PH0=0.29][P=1.01e-12]   12   +   Gene_ID:EST000000   Gene_name: GeneX  Transcript_name: Transcript X  Feature: 5'UTR
chr1    2239149 2239150 CITS_2[gene=chr1_f_c1136][PH=7][PH0=0.40][P=2.21e-04]   7   + Gene_ID:EST0000001   Gene_name: GeneY  Transcript_name: Transcript Y  Feature: lnRNA
chr1    2239899 2239900 CITS_3[gene=chr1_f_c1138][PH=6][PH0=0.21][P=3.56e-04]   6   + Gene_ID:EST0000002   Gene_name: GeneZ  Transcript_name: Transcript Z  Feature: miRNA001

The bed file contains only RNA reads (mRNAs, miRNAs, lnRNAs, snRNAs). I had originally converted the gtf file into a bed file before using bedtools intersect.

But yes you are correct, the gtf file (gencode.v28.annotation.hg38.gtf) is really messy (attributes column):

chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
ADD COMMENT
1
Entering edit mode

If you cannot find all attributes downloading gtf from UCSC try getting it from gencode. If you need some specific features, you can just grep them from gtf making a new file:

awk $3=="transcript"{print $0}'

if you need the original coordinates from your bed file, try

bedtools intersect -wa

It might still be messy and require more reformatting...

ADD REPLY

Login before adding your answer.

Traffic: 2385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6