Gene length table from gtf file for rpkm() function
1
0
Entering edit mode
7.8 years ago
debitboro ▴ 260

Hi Biostars,

I have a RNASeq count table at gene level from HTSeq, and I want to get the gene length table from the used gtf file (from Ensembl) only for the genes in the HTSeq table. Is there anyway or some scripts to do this ?

ENSG00000000003 0 0 1 4 0 0 0 0 0 15 2 11 0
ENSG00000000419 2 2 3 3 14 20 9 13 37 81 13
ENSG00000000460 2 5 7 5 43 11 15 11 14 71 38
....
....

this is my HTSeq table, and I want to get the length only for ENSG00000000003, ENSG00000000419, ENSG00000000460, ...

gene length HTSeq count RPKM edgeR • 5.2k views
ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

Hello,

I used the command which is in the README.md file to get the transcript_length from a gtf file like the following:

cat gtf-file | awk -F'\t' '{if($3=="transcript") {split($9,a,";"); print a[1]"\t"$5-$4};}' | sed 's/[transcript_id |"|]//g' > transcript_length.txt

For Eg: Let's say I have the transcript transcript_002 with start and end positions (42939 49164) with the above command, I got the following output:

transcript_id    length
transcript_002    6225

But when I used another command which is from here https://gist.github.com/sp00nman/10372555

awk -F"\t" '$3=="transcript" {ID=substr($9, length($9)-16, 15); L[ID]+=$5-$4+1}END{for(i in L){print i"\t"L[i]}}' gtf-file

I got another output:

transcript_id    length
transcript_002    6226

Which is the right one?

ADD REPLY
0
Entering edit mode

Hi,

The first command considers the transcript length without taking into account first and the last base. Second one included that by adding +1 to it. In my opinion second one makes sense. Though the first command won't affect the results significantly.

ADD REPLY

Login before adding your answer.

Traffic: 2578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6