Question

GTF2BED does not provide meaningful output?

0

Entering edit mode

6.4 years ago

c_u ▴ 520

Hi guys,

I have been trying to use gtf2bed to convert a gtf file to bed format, but to no avail. On running the following command - gtf2bed < GRCh38p5_copy.gtf > foo1.bed

it gives the error -

Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

I checked the first few lines of the gtf (removed the commented lines too). They are-

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Also, I tried to run the tool on the example gtf file mentioned on their own website but the output bed file it gives me is empty.

Any ideas what could be going wrong?

gtf bed RNA-Seq • 2.8k views

ADD COMMENT • link updated 6.4 years ago by Alex Reynolds 35k • written 6.4 years ago by c_u ▴ 520

score 2 · Accepted Answer · 2017-11-30

2

Entering edit mode

6.4 years ago

Alex Reynolds 35k

There is a bug with Gencode and Ensembl GTF output where they lack the obligatory transcript_id attribute. One solution is to add a dummy attribute, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gtf.gz \
    | gunzip -c - \
    | awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' - \
    | convert2bed --input=gtf - \
    > output.bed

Another option that doesn't muck with the data is to grab the GFF3, where you can, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz \
    | gunzip -c - \
    | convert2bed --input=gff - \
    > output.bed

Some groups do their own thing — there are similar parsing problems caused with deviations from the spec to the annotations published by the Arabidopsis consortium. Oh well! I seem to have more luck with getting GFF3 that follows spec, so I'd look in that direction, maybe.

ADD COMMENT • link 6.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi Alex,

But as I mentioned in my question (the 2 lines I copied from the gtf), the transcript_id is present in the gtf.

Also, what could be the reason that even the example gtf file provided in the website (foo.gtf) also generates an empty bed file?

ADD REPLY • link 6.4 years ago by c_u ▴ 520

0

Entering edit mode

Take another look at your sample file. Not sure what's up with the demo file (I'll look into it) but your sample input does not meet spec.

ADD REPLY • link 6.4 years ago by Alex Reynolds 35k