Question

Extract random number of transcript and exons from gencode annotation

0

Entering edit mode

6.2 years ago

samuel ▴ 240

Hi, I want to extract say 1000 transcripts and related exons from my gencode.gtf. Does anyone know how to do this? I'm a beginner and have used grep to extract 'transcripts' and 'exons' only but as there are not a set number of exons connected to the transcripts I'm not sure how to count to extract 1000 transcripts??

At the moment I have:

 grep 'protein_coding' gencode.v27.gtf | awk '{if($3=="transcript" || $3=="exon")print$0}'

Many thanks!

genome sequencing • 2.0k views

ADD COMMENT • link updated 6.2 years ago by michael.ante ★ 3.8k • written 6.2 years ago by samuel ▴ 240

score 1 · Answer 1 · 2018-01-30

Okay kind of worked out a dirty way of doing it so any more elegant solutions welcome. I found the id of the 1000th transcript and then used sed to print out everything before this id:

> grep 'protein_coding' gencode.v27.gtf | awk '{if($3=="transcript")print$0}'  > test.gtf
## to get the id of the 1000th transcript
 > cat test.gtf | sed -n -e '1000p'
### print everything from the first line to the 1000th transcript id
> sed -n '1,/<TRANSCRIPT_ID FROM THE PREVIOUS LINE OF CODE>/ p' test.gtf

score 0 · Answer 2 · 2018-01-31

Hi Zoegward,

I think you need to extract 1000 transcript IDs from the gtf. Something like

awk '$3=="transcript" && match ($0, /protein_coding/){for(i=1;i<NF;i++){if(match($i,/transcript_id/)){print $(i+1)}}}' gencode.v27.gtf > transcript-list.txt

To extract all Transcript IDs from protein coding genes. In order to get 1000 IDs just uses head -n 1000, tail -n 1000, and if you want to shuffle it shuf it before using tail /head.

The resulting list is used for a grep search:

grep -f 1000-transcript-ids.txt gencode.v27.gtf > out.gtf

Cheers,

Michael