Gene length database
5
2
Entering edit mode
9.6 years ago

Hi,

I'm trying to do a pathways/enrichment analysis and in order to do so, I'm trying to divide the prevalence of a variant in a gene by the length of the gene. Is there a database online where I can easily download gene lengths? I'm working with gene symbols (i.e. KCNQ2) so anything that matched the gene symbol to a length would be preferable.

Thanks!

Database ncbi gene-length • 12k views
ADD COMMENT
3
Entering edit mode
9.6 years ago

Using mysql/ucsc:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select X.geneSymbol,K.name as transcript, K.chrom,K.txEnd-K.txStart as transcriptLen from kgXref as X,knownGene as K where X.geneSymbol="KCNQ2" and X.kgId=K.name '
+------------+------------+-------+---------------+
| geneSymbol | transcript | chrom | transcriptLen |
+------------+------------+-------+---------------+
| KCNQ2      | uc002yex.3 | chr20 |          1804 |
| KCNQ2      | uc002yey.1 | chr20 |         66452 |
| KCNQ2      | uc002yez.1 | chr20 |         66452 |
| KCNQ2      | uc002yfa.1 | chr20 |         66452 |
| KCNQ2      | uc002yfb.1 | chr20 |         66452 |
| KCNQ2      | uc002yfc.1 | chr20 |         38962 |
| KCNQ2      | uc011aax.1 | chr20 |         48707 |
+------------+------------+-------+---------------+
ADD COMMENT
0
Entering edit mode

Hi Pierre, sorry for my stupid question. Here, you showed an example with single gene 'KCNQ2", right? Could you please tell me how to do this with a batch of gene, for example, if I have a file with the gene symbols? Can I put the mysql command in a loop to iterate over the gene symbols?

Thanks

ADD REPLY
0
Entering edit mode

remove the statement ' X.geneSymbol="KCNQ2" and'

ADD REPLY
1
Entering edit mode
9.6 years ago

You might look up the RefSeq accession and version name:

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/gene_RefSeqGene
...
$ grep 'KCNQ2' gene_RefSeqGene
9606    3785    KCNQ2   NG_009004.1

Then look up the alignment:

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/GCF_000001405.26_refseqgene_alignments.gff3
...
$ grep 'NG_009004.1' GCF_000001405.26_refseqgene_alignments.gff3
NC_000020.11    RefSeq  match   63404189        63477640        73452   +       .       ID=aln1679;Target=NG_009004.1 1 73452 -;bit_score=136701;e_value=0;gap_count=0;num_ident=73452;num_mismatch=0;pct_coverage=100;pct_identity_gap=100;pct_identity_gapopen_only=100;pct_identity_ungap=100;score=73452

The accession NC_000020.11 tells you KCNQ2 is mapped to chromosome 20 at bases 63404189 to 63477640. Note that this alignment is to assembly hg38, which you may or may not be using:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr20%3A63404189-63477640

ADD COMMENT
0
Entering edit mode

Is it possible to do this in python?

ADD REPLY
0
Entering edit mode

There is a library to wget in python. You can do regular expressions in python with re. Pierre's answer is better, probably.

ADD REPLY
0
Entering edit mode

Thank you for the code! I have a problem and that is when I get the file: wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/GCF_000001405.25_refseqgene_alignments.gff3 there are only 7750 GeneRef Accessions. I'm a bit confused about that because I need the length of all of the genes and not 7750 of them. Is there any way to find the length of all of the genes? Thanks!

ADD REPLY
0
Entering edit mode
9.6 years ago

I've not seen a convenient database of this (this isn't usually stored explicitly), but you can determine this given an annotation file easily enough. Here's an example of how to do that (and a bit more) in R: Normalization Of Rna Sequencing Counts (By Ercc / Gene Length)

ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode
7.8 years ago
anp375 ▴ 180

This was a long, complicated, and tiring process. Getting the variant density is annoying as well if you don't know how to use R very efficiently. I'm just about done with it. If I'm allowed to later, I will post the methods up. However, if you don't care about distinguishing regions that are intron/exon/UTR/coding/etc., then just look at the gtf. Extract the whole genes. If you want a specific region, you'll have to extract that type and concatenate the results from the gtf. If you're dealing with targeted sequencing, then the process is more complicated.

ADD COMMENT
0
Entering edit mode

Hi, If you don't mind how can I extract the UTRs sequences using R?

ADD REPLY

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6