Extracting entries from GTF file
2
0
Entering edit mode
7.2 years ago
EVR ▴ 610

Hi,

I am quite new to bioinformatics. I have a gtf file and also bed file which includes the trascript_name Start End position. For every transcript's start end position, I would like to extract all the exons present between the start and end coordiantes of transcript. For an examples. imagine a gtf file like following

Scaffold1 cuff transcript 344  540  100 + geneid "cuff_45"
Scaffold1 cuff exon 344  400  100 + geneid "cuff_45"
Scaffold1 cuff exon 484  540  100 + geneid "cuff_45"
Scaffold1 cuff transcript 800  1200  100 + geneid "cuff_46"
Scaffold1 cuff exon 800  928  100 + geneid "cuff_46"
Scaffold1 cuff exon 980  1100  100 + geneid "cuff_46"
Scaffold1 cuff exon 1100  1200  100 + geneid "cuff_46"
Scaffold2 cuff transcript 1 500 1000 - gene_id "cuff_47"
Scaffold2 cuff exon 1 500 1000 - gene_id "cuff_47"

and a bed file like following

Scaffold1 344 540

Then I would like extract entries of Scaffold1 and its exons from gtf file like following

Scaffold1 cuff transcript 344  540  100 + geneid "cuff_45"
Scaffold1 cuff exon 344  400  100 + geneid "cuff_45"
Scaffold1 cuff exon 484  540  100 + geneid "cuff_45"

Can someone suggest any tool to achieve my goal.

Thanks in advance.

GTF bed • 4.9k views
ADD COMMENT
1
Entering edit mode

Use unix utility: grep "Scaffold1" your.gtf > scaf1.gtf

Do you also want a BED file from your.gtf or you already have that?

ADD REPLY
0
Entering edit mode

Thanks for your reply. I already tried with unix but it also incldes all the entries have Scaffold1 but I need to extract entries between the transcript start and end. I have updated the gtf file in my question. Kindly take a look and guide me

ADD REPLY
0
Entering edit mode

There are 7 lines that have Scaffold1 in your example. grep above would get all 7. So instead of all 7, you just want the lines that match the interval in your BED file?

ADD REPLY
0
Entering edit mode

Exactly. I just need need to extartc all the exons confined within the transcript start and end like mentioned in the above.

ADD REPLY
3
Entering edit mode
7.2 years ago

Given a GTF file called annotations.gtf with your annotations, and a sorted BED file called intervals.bed that contains your regions of interest, you could do the following with BEDOPS:

$ gtf2bed < annotations.gtf | bedops -e 1 - intervals.bed > answer.bed

The file answer.bed contains BED-formatted annotations from the GTF file, which overlap the provided intervals by one or more bases.

ADD COMMENT
0
Entering edit mode
7.2 years ago
cmdcolin ★ 3.8k

You may like to use tabix. Tabix has a setting that works well with GFF, so I would convert the GTF to GFF (probably using gffread from cufflinks package, for example gffread -E merged.gtf -o- > merged.gff3)

Then you can bgzip and tabix the output

bgzip myfile.sorted.gff
tabix -p gff myfile.sorted.gff.gz

Then you can retrieve the coordinates with

tabix myfile.sorted.gff.gz Scaffold1:344-540

Note you may also need to sort the GFF. See some additional tips here http://gmod.org/wiki/JBrowse_FAQ#How_do_I_create_a_Tabix_indexed_GFF

ADD COMMENT

Login before adding your answer.

Traffic: 2431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6