Biostar Beta. Not for public use.
Find The Gene Length Of A Big List Of Genes
1
Entering edit mode
6.9 years ago
Floris Brenk • 890
@Floris Brenk9611

Hi all,

I have a list of hunderds of (human) gene transcripts and I would like to know what their average length is. This is an expression list and of the identified transcripts I found that this list of genes is influenced by the RIN value of the starting material. So the reason I would like to have the gene length is so I can investigate whether gene length is more prone or less prone for RNA degradation.

gene length • 4.1k views
1
Entering edit mode

What format do you have your data in?

0
Entering edit mode

Bed file format of the transcription start site

0
Entering edit mode

So you don't have end and exon coordinates? Are these novel transcripts or known transcripts?

0
Entering edit mode

No only the TSS and these are known genes only to start with yes

1
Entering edit mode

Would it then suffice to use the lengths of already annotated transcripts? Then you could download the transcript annotations from ucsc or ensembl as gff by using the transcript ids. If you do not have the transcript ids but only positions you could search all transcripts for overlap with or closest transcripts to your TSS.

1
Entering edit mode

Do you have gene IDs or Symbols? Do you need to know the length of the gene or the (average) length of known transcripts for that gene?

0
Entering edit mode

He sorry for late reply just saw it now. I have the gene offical symbols and would like to know the average lenght of known transcripts of the gene yes.

2
Entering edit mode
6.9 years ago
@Jarretinha148

Well, answer is a little bit involved and uses biopython. First, we need to convert gene symbols do Entrez Gene IDs. I do this by brute force mapping. Get the gene_info from NCBI. Convert the file to some form of delimiter separated values file (I use tab separated values with header).

import csv

# Creates the mapping
with open('./data/Homo_sapiens.gene_info.tsv') as f:
mapping = dict()
syn = row['Synonyms']
syn = syn.split('|')
tag = [row['GeneID'], row['Symbol']]
for s in syn:
mapping[s] = tag
mapping[row['Symbol']] = tag

with open('your_genes.txt') as f, open('mapped.tsv', 'w') as w:
for line in f:
gene_symbol = line.strip()
if gene_symbol in mapping.keys():
writer.writerow(mapping[gene_symbol])
else:


It's important to stress that GeneCards and NCBI not always share the same gene symbol/synonyms. Now, we can proceed to the fun part:

import csv
from Bio import Entrez, SeqIO

def fetch_mRNA_lengths(GeneID):
"""Get all mRNAs for a given GeneID from Entrez Gene and prints its accessions, GIs and lengths"""
try:
handle = Entrez.efetch(db='gene', id=GeneID, retmode='xml')
# The first Entrezgene_locus is for the GRCh38 Primary Assembly
locus = record['Entrezgene_locus'][0]
products = locus['Gene-commentary_products']
for product in products:
# 3 is for mRNA
if product['Gene-commentary_type'] == '3':
accn = product['Gene-commentary_accession']
gi = product['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
handle = Entrez.efetch(db='nucleotide', id=gi, rettype='gb')
print GeneID, accn, gi, len(record)
except:

Entrez.email = "insert.your@email.here"

with open('mapped.tsv') as f:
fetch_mRNA_lengths(str(row['GeneID']))


Have fun!!!