Question

Identifying Genes That Lie Either Side Of Intergenic Regions

0

Entering edit mode

10.9 years ago

ancient_learner ▴ 680

Hi I want to identify the immediate genes that lie either side of the intergenic regions. how to get that list with positions? fo example i have a set of intergenic regions in bed format like below for mouse genome

chr1    3044715    3044756
chr1    3095270    3095341
chr1    3100822    3100894
chr1    3167949    3168020
chr1    3205652    3205723
chr1    3684683    3684752

There should be some gene before start and after end of the intergenic regions i just want to find those genes

ucsc • 3.6k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 10.9 years ago by ancient_learner ▴ 680

2

Entering edit mode

Please clarify your question.

ADD REPLY • link 10.9 years ago by PoGibas 5.1k

score 5 · Answer 1 · 2013-05-21

I assume you have the coordinates for your intergenic regions or atleast you should be able to get these details in bed format and then you can use bedtools for this...

You can use a simple awk script and extend the coordinates as to how far would you like to go and look for a gene on either sides....

bedtools closest -a both.bed -b genes.bed -io > both.nearest.genes.txt

There are more than one way to do this using bedtools... The documentation is quite clear and if you post some of your sample data, you might get more specific help!

Here is another way to achieve the same results

score 2 · Answer 2 · 2013-05-21

Another way to do this is to use the BEDOPS bedmap application to map genes to the domain defined outside your input regions.

First, pre-process your elements to define the bounds of flanking regions. For example, you might define flanking regions as +/- 1kb outside your input regions. Say you start with a BED file called myRegions.bed:

$ more myRegions.bed
chr1    3044715    3044756
chr1    3095270    3095341
chr1    3100822    3100894
chr1    3167949    3168020
chr1    3205652    3205723
chr1    3684683    3684752

You could use a simple awk script to generate your new flanking regions:

$ awk ' \
    BEGIN { \
        pad=1000;
    } \
    { \
        chr=$1; \
        start=$2; \
        stop=$3; \
        print $1"\t"(start - pad)"\t"start"\n"$1"\t"stop"\t"(stop + pad); \
    }' myRegions.bed \
    | sort-bed - \
    > myFlanks.bed

Note that we pass the output from awk to BEDOPS sort-bed, to ensure the output file (myFlanks.bed) is a sorted BED file. Sorting prepares the output for use with BEDOPS tools and allows them to operate efficiently, compared with alternatives.

Now grab some genes and turn them into a BED file. We'll take Ensembl gene annotations for mm10:

$ wget -O - ftp://ftp.ensembl.org/pub/mnt2/release-71/gtf/mus_musculus/Mus_musculus.GRCm38.71.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > mm10.genes.bed

We have two sorted, BED-formatted inputs: our +/- 1kb flanking regions and our gene annotations. We're now ready to use bedmap to do mapping:

$ bedmap --echo --echo-map-range myFlanks.bed mm10.genes.bed > answer.bed

The file answer.bed is a sorted BED file that contains your flanking regions and a semi-colon-delimited list of gene ranges which overlap the flanking region:

[ flank-1 ] | [ gene-1-1 ] ; ... ; [ gene-1-i ]
[ flank-2 ] | [ gene-2-1 ] ; ... ; [ gene-2-j ]
...
[ flank-N ] | [ gene-N-1 ] ; ... ; [ gene-N-k ]

Where no genes overlap a flanking region, nothing is printed. The default overlap parameter is one or more bases. You can make this setting more stringent by adjusting overlap criteria.

If you want the entire gene record — and not just the range — use --echo-map in place of --echo-map-range. Other mapping echo options are available, depending on what you want bedmap to report.

NOTE: You might consider complications of double- or multiple-counting of genes, where flanking regions overlap. In that case, you could use a --merge set operation with BEDOPS bedops to merge overlapping flanking regions, before running bedmap. This would ensure that you only count or label genes once:

$ awk ' \
    BEGIN { \
        pad=1000;
    } \
    { \
        chr=$1; \
        start=$2; \
        stop=$3; \
        print $1"\t"(start - pad)"\t"start"\n"$1"\t"stop"\t"(stop + pad); \
    }' myRegions.bed \
    | sort-bed - \
    | bedops --merge - \
    > myMergedFlanks.bed

A second approach is to use BEDOPS closest-features, which would search for and report the nearest gene to the input region. Let's first sort your regions:

$ sort-bed myRegions.bed > mySortedRegions.bed

Grab the Ensembl mm10 records as described above.

Now we apply closest-features on the sorted regions, to look for the closest Ensembl annotations:

$ closest-features --closest mySortedRegions.bed mm10.genes.bed > answer.bed

The file answer.bed will contain the region element and the nearest gene to that element, either upstream or downstream, with ties given to the upstream element:

[ region-1 ] | [ nearest-gene-1 ]
[ region-2 ] | [ nearest-gene-2 ]
...
[ region-N ] | [ nearest-gene-N ]

If you add the --delim '\t' option to closest-features, you'll get back a sorted BED file. This is useful in that standard output from BEDOPS tools can be piped to other downstream tools that can accept BED-formatted standard input.

All BEDOPS tools are designed to operate in larger pipelines in this manner, which helps reduce file I/O and provide added performance benefits, in addition to speed and memory enhancements in the BEDOPS toolkit.

Ram · Answer 3 · 2013-05-22

A simple option is using SnpEff's "closesExon" command:

$ java -Xmx4g -jar snpEff.jar closestExon -bed GRCh37.70 test.bed 
1    12077    12078    line_1;0,exon_1_11869_12227,ENST00000456328,DDX11L1
1    16096    16097    line_2;150,exon_1_15796_15947,ENST00000423562,WASH7P
1    40260    40261    line_3;4180,exon_1_35721_36081,ENST00000417324,FAM138A
1    63879    63880    line_4;0,exon_1_62948_63887,ENST00000492842,OR4G11P

It also gives you the distance in bases to the exon.

More details here.