Programmatically retrieve the length of a protein
2
0
Entering edit mode
5.7 years ago
Natasha ▴ 40

I have a list of genes and I'm looking for a programmatic way to retrieve the length of each gene. For instance, in UniProt the length of a gene is specified in the Sequences section. Can we query this using a script

Many thanks for the time and attention

Edit by @genomax: Since this question is about length of a protein title edited to reflect that requirement.

gene • 1.9k views
ADD COMMENT
1
Entering edit mode

Hello deepamahm.iisc,

You've stated that you want to retrieve the length of a gene. What have you tried? Have you put any effort into thinking about this? The answer to these should be "yes", and you should elaborate a bit on:

  • what your actual goals are
  • what you've tried so far
  • why that is or isn't working; and

If you don't demonstrate that you've put effort into the problem you're facing, people cannot be expected to help you solve it.

-Vijay

ADD REPLY
0
Entering edit mode

Please provide examples of ID's and source where you want to get this info from.

ADD REPLY
0
Entering edit mode

Many thanks for the response. Sorry for being less specific, the gene id's are the UniProt Identifiers. For example, P35575, it's isoform P35575 -1. The source from where I want to get this data from is UniProt

ADD REPLY
0
Entering edit mode

Define length of gene, is that all exonic regions or also the introns? If your organism is annotated, try to use the annotation files from ensembl or UCSC. In these gtf file you can extract the coordinates of each gene or exon.

ADD REPLY
0
Entering edit mode

The length that I am referring to is the one specified in UniProt. Please find the snapshot attached

ADD REPLY
1
Entering edit mode

If you obtain the FASTA record, you can obtain the length of that by just counting the characters. There will always be 'dispute' over what is the true length of a gene, though.

If the NGS protocol is targeting spliced mRNA, then the 'gene length' should be that of the spliced mRNA that is transcribed and then sequenced, etc.

ADD REPLY
0
Entering edit mode

Thanks a lot Kevin! I have scripts for querying the FASTA sequences.

ADD REPLY
0
Entering edit mode

Be aware of the distinction to which b.nota relates, i.e., there is a difference between:

  • unspliced gene length
  • spliced gene length
  • protein length etc.

The one you choose may be important for downstream analyses.

ADD REPLY
0
Entering edit mode

Excuse me for the naive question. In microarray experiments, the probe binds to the target gene and the expression levels are obtained. I'm looking for the length of the genes reported in those studies. Currently, I look at the corresponding protein id in UniProt to obtain the length and convert it to gene length by multiplying with 3. But, I understand what I'm doing is wrong from the explanation given in the answers to my post. Could you please explain, among the 3 lengths stated above, which length should be computed if I am looking at the genes reported in microarray results?

ADD REPLY
0
Entering edit mode

Be aware that the length you show in your attached figure is not the gene length but protein length.

ADD REPLY
0
Entering edit mode

I usually download the Ensembl GTF file and read in each gene start and end sites using pd.read_csv, and compute the length using the start and end sites.

Hope this help.

ADD REPLY
7
Entering edit mode
5.7 years ago
GenoMax 141k

You should have said Protein instead of gene to be specific.

https://www.uniprot.org/uniprot/?query=P35575&columns=length&format=tab

will get you

Length
357

Information about programmatic access to UniProt is available on this page.

ADD COMMENT
0
Entering edit mode

Would it be wrong if the length of the protein is multiplied by 3 to obtain the number of base pairs in a gene?

ADD REPLY
1
Entering edit mode

It would be wrong in view of the fact that your are looking at a eukaryotic protein where the structure of a gene is complex and includes non-coding elements. If you wanted to get the length of the actual gene then you would need to find that information by a different query.

ADD REPLY
0
Entering edit mode

Could you please suggest how the length of the actual gene has to be calculated? Which query has to be used?

ADD REPLY
1
Entering edit mode
5.7 years ago

Use seq_stat.py, it will give output as tab-separated file, containg header, count of each nucleotide and sequence length. If multifasta file is provided, it will print this statistics. The sequence length of each sequence will be in last column.

ADD COMMENT
0
Entering edit mode

OP is not starting with a sequence but an ID from UniProt. I don't think this solution is going to work for the original question.

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6