Question

distance between two genes

0

Entering edit mode

8.8 years ago

mika6891 • 0

Hi,

Is there a place where I can retrieve the distance between two genes on the same chromosome? I have a list of 100 genes, so it would be nice to retrieve this information from a database.

genome • 7.5k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by mika6891 • 0

0

Entering edit mode

Care to give some examples?

ADD REPLY • link 8.8 years ago by 5heikki 11k

Ram · Answer 1 · 2015-06-23

4

Entering edit mode

8.8 years ago

abascalfederico ★ 1.2k

You could just obtain the coordinates of those genes and then do a simple arithmetic operation: max(start_gene1, start_gene2)-min(end_gene1,end_gene2), assuming start is the lowest coordinate, disregarding the strand orientation (end would be the real start of a gene located in the minus strand). In case genes are overlapping you will get a negative number.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by abascalfederico ★ 1.2k

0

Entering edit mode

This method may not work if genes have many exons, that mean one gene may have many start points and end points.

ADD REPLY • link 8.2 years ago by syrttgump ▴ 50

Ram · Answer 2 · 2015-06-23

You are in luck, I just wrote a short script to do that with olfactory receptor genes from the mouse genome. I got some fused genes since they are close and with similar sequences.

I use the UCSC browser to download the mm10 mouse genome Ensemble gene table as a bed file, and subset it using grep -f listOlfrGene.txt where listOlfrGene.txt contains Ensembl transcript ids gathered from Biomart (based on a GO term search for olfactory receptor function)
The subset bed file is then sorted using bedtools sort bedFile > olfr_genes_sorted.bed (http://bedtools.readthedocs.org/en/latest/index.html)
I run bedtools closest -s -d -io -N -a olfr_genes_sorted.bed -b olfr_genes_sorted.bed > output.bed. This gets me a new bed file in the format gene #1 bed data | closest gene #2 bed data | distance between #1 and #2. Here the closest gene has to be distinct, on the same strand of the same chromosome and not overlapping (-s -d -io -N options, read the manual).
This file is simplified by running awk '{print $NF,"\t",$1,"\t",$4,"\t",$10}' output.bed > closestOlfrGenes.txt to get the data in the distance | chromosome | geneID #1 | geneID #2 format (which I find more convenient)
sort -n closestOlfrGenes.txt | awk '$1 > 0 {print $0}' > sortedClosestOlfrGenes.txt gets me the values sorted by distance. I use the awk part to get rid of a couple values that were at -10 for some reasons.

You have here a sample from each file http://pastebin.com/dMh7MQUU. Note that the end results is such that you will find paired lines in this format: distanceX gene1 gene2 \n distanceX gene2 gene1 \n

For visualization, with the results here (http://imgur.com/caNxDew):

library(dplyr)
closestOlfr=read.csv(file="sortedClosestOlfrGenes.txt",sep="",header=FALSE,na.strings = ".",col.names=c("dist","chr","gene","closest"))
closestOlfr$dist=closestOlfr$dist/1000 # convert to kb
h<-hist(closestOlfr$dist[closestOlfr$dist<=100], breaks=100, col="red", xlab="Distance to closest olfactory gene (kb)", main="Relative proximity of olfactory genes (cut-off at 100kb)") 
dist_wanted=20
print(c("For this threshold (kb):",dist_wanted,"here is the number of close genes",sum(closestOlfr$dist<=dist_wanted)))

I conclude that RNA-Seq alignment with a maximum intron size of 25000 are still too high.

score 3 · Answer 3 · 2015-06-23

3

Entering edit mode

8.8 years ago

Anima Mundi ★ 2.9k

Hello, you can download the genomic coordinates of your genes (e.g. from BioMart), sort the list according to chromosomal location and then measure the distances via scripting.

ADD COMMENT • link 8.8 years ago by Anima Mundi ★ 2.9k