GTF2BED does not provide meaningful output?
1
0
Entering edit mode
6.4 years ago
c_u ▴ 520

Hi guys,

I have been trying to use gtf2bed to convert a gtf file to bed format, but to no avail. On running the following command - gtf2bed < GRCh38p5_copy.gtf > foo1.bed

it gives the error -

Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

I checked the first few lines of the gtf (removed the commented lines too). They are-

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Also, I tried to run the tool on the example gtf file mentioned on their own website but the output bed file it gives me is empty.

Any ideas what could be going wrong?

gtf bed RNA-Seq • 2.8k views
ADD COMMENT
2
Entering edit mode
6.4 years ago

There is a bug with Gencode and Ensembl GTF output where they lack the obligatory transcript_id attribute. One solution is to add a dummy attribute, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gtf.gz \
    | gunzip -c - \
    | awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' - \
    | convert2bed --input=gtf - \
    > output.bed

Another option that doesn't muck with the data is to grab the GFF3, where you can, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz \
    | gunzip -c - \
    | convert2bed --input=gff - \
    > output.bed

Some groups do their own thing — there are similar parsing problems caused with deviations from the spec to the annotations published by the Arabidopsis consortium. Oh well! I seem to have more luck with getting GFF3 that follows spec, so I'd look in that direction, maybe.

ADD COMMENT
0
Entering edit mode

Hi Alex,

But as I mentioned in my question (the 2 lines I copied from the gtf), the transcript_id is present in the gtf.

Also, what could be the reason that even the example gtf file provided in the website (foo.gtf) also generates an empty bed file?

ADD REPLY
0
Entering edit mode

Take another look at your sample file. Not sure what's up with the demo file (I'll look into it) but your sample input does not meet spec.

ADD REPLY

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6