Question

Parsing GENCODE GTF to simpler BED files: am I reinventing the wheel ?

0

Entering edit mode

8.3 years ago

Charles Plessy ★ 2.9k

Dear Biostars,

I am using a shell script to transform a GENCODE GTF file into smaller BED files that I use to annotate transcriptome (CAGE) with information such as promoter/intron/exon classification or gene name.

Just as a reminder, GENCODE looks like this:

$ zcat gencode.v23.annotation.gtf.gz | cut -c -80 | head
##description: evidence-based annotation of the human genome (GRCh38), version 2
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2015-07-15
chr1    HAVANA    gene    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; gene_type "trans
chr1    HAVANA    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript
chr1    HAVANA    exon    11869    12227    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E
chr1    HAVANA    exon    12613    12721    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E
chr1    HAVANA    exon    13221    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E

The kind of BED files I produce look like that:

$ head gencode.v23.annotation.bed
chr1    11368    12369    promoter    0    +
chr1    11858    11879    boundary    0    +
chr1    11868    12227    exon    0    +
chr1    11868    14409    gene    0    +
chr1    11868    14409    transcribed_unprocessed_pseudogene_DDX11L1    0    +
chr1    11999    12020    boundary    0    +
chr1    12009    12057    exon    0    +
chr1    12046    12067    boundary    0    +
chr1    12168    12189    boundary    0    +
chr1    12178    12227    exon    0    +

$ head gencode.v23.annotation.genes.bed
chr1    11868    14409    DDX11L1    0    +
chr1    14403    29570    WASH7P    0    -
chr1    17368    17436    MIR6859-1    0    -
chr1    29553    31109    RP11-34P13.3    0    +
chr1    30365    30503    MIR1302-2    0    +
chr1    34553    36081    FAM138A    0    -
chr1    52472    53312    OR4G4P    0    +
chr1    62947    63887    OR4G11P    0    +
chr1    69090    70008    OR4F5    0    +
chr1    89294    133723    RP11-34P13.7    0    -

Instead of maintaining a script by myself, I would love to use a commonly used, proof-tested, well-maintained tool. Do you have something to recommend to me?

Thanks!

GTF BED GENCODE • 3.3k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.3 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

As I just found this old post, I would like to comment that I am not using this script anymore. Instead I load the GENCODE file in R and parse it with Bioconductor in functions of the CAGEr package such as ranges2annot.

ADD REPLY • link 3.9 years ago by Charles Plessy ★ 2.9k

Ram · Answer 1 · 2016-01-13

0

Entering edit mode

8.3 years ago

Alex Reynolds 35k

You could use the GTF option in BEDOPS convert2bed, or the equivalent wrapper script gtf2bed:

$ convert2bed -i gtf -o bed < foo.gtf > foo.bed
$ gtf2bed < foo.gtf > foo.bed

If you need columns in a certain ordering, or only some subset of BED columns, you can pipe the result to common Unix tools like cut and awk.

$ gtf2bed < foo.gtf | cut -f1-6 > foo.bed6

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Alex Reynolds 35k